超碰碰97,无码不卡电影,成人免费一区二区三区

Word處理控件Aspose.Words功能演示：在 Python 中從 Word 文檔中提取文本

翻譯|使用教程|編輯：胡濤|2022-05-16 15:27:59.647|閱讀 240 次

概述：我們將介紹如何動態提取段落、表格等特定元素之間的內容。

相關鏈接：

在 Python 中從 Word DOCX 文檔中提取內容

從 Word 文檔中提取文本通常在不同的場景中執行。例如，分析文本，提取文檔的特定部分并將它們組合成單個文檔，等等。在本文中，您將學習如何在 Python 中以編程方式從 Word 文檔中提取文本。此外，我們將介紹如何動態提取段落、表格等特定元素之間的內容。

Aspose.Words for . Python 最新下載

信息：如果您需要從 PowerPoint 演示文稿中獲取 Word 文檔，您可以使用 Aspose演示文稿到 Word 文檔轉換器。

從 Word 文檔中提取文本的 Python 庫

Aspose.Words for Python是一個強大的庫，可讓您從頭開始創建 MS Word 文檔。此外，它可以讓您操作現有的 Word 文檔進行加密、轉換、文本提取等。我們將使用這個庫從 Word DOCX 或 DOC 文檔中提取文本。您可以使用以下 pip 命令從PyPI安裝庫。

pip install aspose-words

使用 Python 在 Word 文檔中提取文本

MS Word 文檔由各種元素組成，包括段落、表格、圖像等。因此，文本提取的要求可能因一種情況而異。例如，您可能需要在段落、書簽、評論等之間提取文本。

Word 文檔中的每種類型的元素都表示為一個節點。因此，要處理文檔，您將不得不使用節點。那么讓我們開始看看如何在不同的場景下從 Word 文檔中提取文本。

在 Python 中從 Word 文檔中提取文本

在本節中，我們將為 Word 文檔實現一個 Python 文本提取器，文本提取的工作流程如下：

首先，我們將定義要包含在文本提取過程中的節點。
然后，我們將提取指定節點之間的內容（包括或不包括開始和結束節點）。
最后，我們將使用提取節點的克隆，例如創建一個包含提取內容的新 Word 文檔。

現在讓我們編寫一個名為extract_content的方法，我們將向該方法傳遞節點和一些其他參數來執行文本提取。此方法將解析文檔并克隆節點。以下是我們將傳遞給此方法的參數。

StartNode 和 EndNode 分別作為內容提取的起點和終點。這些可以是塊級（Paragraph 、 Table）或內聯級（例如 Run、 FieldStart、 BookmarkStart 等）節點。
1. 要傳遞一個字段，您應該傳遞相應的 FieldStart 對象。
2. 要傳遞書簽，應傳遞BookmarkStart 和 BookmarkEnd節點。
3. 對于評論，應使用CommentRangeStart 和 CommentRangeEnd節點。
IsInclusive定義標記是否包含在提取中。如果此選項設置為 false 并且傳遞相同的節點或連續節點，則將返回一個空列表。

以下是extract_content方法的完整實現，該方法提取傳遞的節點之間的內容。

def extract_content(startNode : aw.Node, endNode : aw.Node, isInclusive : bool):

# First, check that the nodes passed to this method are valid for use.
verify_parameter_nodes(startNode, endNode)

# Create a list to store the extracted nodes.
nodes = []

# If either marker is part of a comment, including the comment itself, we need to move the pointer
# forward to the Comment Node found after the CommentRangeEnd node.
if (endNode.node_type == aw.NodeType.COMMENT_RANGE_END and isInclusive) :

node = find_next_node(aw.NodeType.COMMENT, endNode.next_sibling)
if (node != None) :
endNode = node

# Keep a record of the original nodes passed to this method to split marker nodes if needed.
originalStartNode = startNode
originalEndNode = endNode

# Extract content based on block-level nodes (paragraphs and tables). Traverse through parent nodes to find them.
# We will split the first and last nodes' content, depending if the marker nodes are inline.
startNode = get_ancestor_in_body(startNode)
endNode = get_ancestor_in_body(endNode)

isExtracting = True
isStartingNode = True
# The current node we are extracting from the document.
currNode = startNode

# Begin extracting content. Process all block-level nodes and specifically split the first
# and last nodes when needed, so paragraph formatting is retained.
# Method is a little more complicated than a regular extractor as we need to factor
# in extracting using inline nodes, fields, bookmarks, etc. to make it useful.
while (isExtracting) :

# Clone the current node and its children to obtain a copy.
cloneNode = currNode.clone(True)
isEndingNode = currNode == endNode

if (isStartingNode or isEndingNode) :

# We need to process each marker separately, so pass it off to a separate method instead.
# End should be processed at first to keep node indexes.
if (isEndingNode) :
# !isStartingNode: don't add the node twice if the markers are the same node.
process_marker(cloneNode, nodes, originalEndNode, currNode, isInclusive, False, not isStartingNode, False)
isExtracting = False

# Conditional needs to be separate as the block level start and end markers, maybe the same node.
if (isStartingNode) :
process_marker(cloneNode, nodes, originalStartNode, currNode, isInclusive, True, True, False)
isStartingNode = False

else :
# Node is not a start or end marker, simply add the copy to the list.
nodes.append(cloneNode)

# Move to the next node and extract it. If the next node is None,
# the rest of the content is found in a different section.
if (currNode.next_sibling == None and isExtracting) :
# Move to the next section.
nextSection = currNode.get_ancestor(aw.NodeType.SECTION).next_sibling.as_section()
currNode = nextSection.body.first_child

else :
# Move to the next node in the body.
currNode = currNode.next_sibling

# For compatibility with mode with inline bookmarks, add the next paragraph (empty).
if (isInclusive and originalEndNode == endNode and not originalEndNode.is_composite) :
include_next_paragraph(endNode, nodes)

# Return the nodes between the node markers.
return nodes

extract_content方法還需要一些輔助方法來完成文本提取操作，如下所示。

def verify_parameter_nodes(start_node: aw.Node, end_node: aw.Node):

# The order in which these checks are done is important.
if start_node is None:
raise ValueError("Start node cannot be None")
if end_node is None:
raise ValueError("End node cannot be None")

if start_node.document != end_node.document:
raise ValueError("Start node and end node must belong to the same document")

if start_node.get_ancestor(aw.NodeType.BODY) is None or end_node.get_ancestor(aw.NodeType.BODY) is None:
raise ValueError("Start node and end node must be a child or descendant of a body")

# Check the end node is after the start node in the DOM tree.
# First, check if they are in different sections, then if they're not,
# check their position in the body of the same section.
start_section = start_node.get_ancestor(aw.NodeType.SECTION).as_section()
end_section = end_node.get_ancestor(aw.NodeType.SECTION).as_section()

start_index = start_section.parent_node.index_of(start_section)
end_index = end_section.parent_node.index_of(end_section)

if start_index == end_index:

if (start_section.body.index_of(get_ancestor_in_body(start_node)) >
end_section.body.index_of(get_ancestor_in_body(end_node))):
raise ValueError("The end node must be after the start node in the body")

elif start_index > end_index:
raise ValueError("The section of end node must be after the section start node")


def find_next_node(node_type: aw.NodeType, from_node: aw.Node):

if from_node is None or from_node.node_type == node_type:
return from_node

if from_node.is_composite:

node = find_next_node(node_type, from_node.as_composite_node().first_child)
if node is not None:
return node

return find_next_node(node_type, from_node.next_sibling)


def is_inline(node: aw.Node):

# Test if the node is a descendant of a Paragraph or Table node and is not a paragraph
# or a table a paragraph inside a comment class that is decent of a paragraph is possible.
return ((node.get_ancestor(aw.NodeType.PARAGRAPH) is not None or node.get_ancestor(aw.NodeType.TABLE) is not None) and
not (node.node_type == aw.NodeType.PARAGRAPH or node.node_type == aw.NodeType.TABLE))


def process_marker(clone_node: aw.Node, nodes, node: aw.Node, block_level_ancestor: aw.Node,
is_inclusive: bool, is_start_marker: bool, can_add: bool, force_add: bool):

# If we are dealing with a block-level node, see if it should be included and add it to the list.
if node == block_level_ancestor:
if can_add and is_inclusive:
nodes.append(clone_node)
return

# cloneNode is a clone of blockLevelNode. If node != blockLevelNode, blockLevelAncestor
# is the node's ancestor that means it is a composite node.
assert clone_node.is_composite

# If a marker is a FieldStart node check if it's to be included or not.
# We assume for simplicity that the FieldStart and FieldEnd appear in the same paragraph.
if node.node_type == aw.NodeType.FIELD_START:
# If the marker is a start node and is not included, skip to the end of the field.
# If the marker is an end node and is to be included, then move to the end field so the field will not be removed.
if is_start_marker and not is_inclusive or not is_start_marker and is_inclusive:
while node.next_sibling is not None and node.node_type != aw.NodeType.FIELD_END:
node = node.next_sibling

# Support a case if the marker node is on the third level of the document body or lower.
node_branch = fill_self_and_parents(node, block_level_ancestor)

# Process the corresponding node in our cloned node by index.
current_clone_node = clone_node
for i in range(len(node_branch) - 1, -1):

current_node = node_branch[i]
node_index = current_node.parent_node.index_of(current_node)
current_clone_node = current_clone_node.as_composite_node.child_nodes[node_index]

remove_nodes_outside_of_range(current_clone_node, is_inclusive or (i > 0), is_start_marker)

# After processing, the composite node may become empty if it has doesn't include it.
if can_add and (force_add or clone_node.as_composite_node().has_child_nodes):
nodes.append(clone_node)


def remove_nodes_outside_of_range(marker_node: aw.Node, is_inclusive: bool, is_start_marker: bool):

is_processing = True
is_removing = is_start_marker
next_node = marker_node.parent_node.first_child

while is_processing and next_node is not None:

current_node = next_node
is_skip = False

if current_node == marker_node:
if is_start_marker:
is_processing = False
if is_inclusive:
is_removing = False
else:
is_removing = True
if is_inclusive:
is_skip = True

next_node = next_node.next_sibling
if is_removing and not is_skip:
current_node.remove()


def fill_self_and_parents(node: aw.Node, till_node: aw.Node):

nodes = []
current_node = node

while current_node != till_node:
nodes.append(current_node)
current_node = current_node.parent_node

return nodes


def include_next_paragraph(node: aw.Node, nodes):

paragraph = find_next_node(aw.NodeType.PARAGRAPH, node.next_sibling).as_paragraph()
if paragraph is not None:

# Move to the first child to include paragraphs without content.
marker_node = paragraph.first_child if paragraph.has_child_nodes else paragraph
root_node = get_ancestor_in_body(paragraph)

process_marker(root_node.clone(True), nodes, marker_node, root_node,
marker_node == paragraph, False, True, True)


def get_ancestor_in_body(start_node: aw.Node):

while start_node.parent_node.node_type != aw.NodeType.BODY:
start_node = start_node.parent_node
return start_node
def generate_document(src_doc: aw.Document, nodes):

dst_doc = aw.Document()
# Remove the first paragraph from the empty document.
dst_doc.first_section.body.remove_all_children()

# Import each node from the list into the new document. Keep the original formatting of the node.
importer = aw.NodeImporter(src_doc, dst_doc, aw.ImportFormatMode.KEEP_SOURCE_FORMATTING)

for node in nodes:
import_node = importer.import_node(node, True)
dst_doc.first_section.body.append_child(import_node)

return dst_doc


def paragraphs_by_style_name(doc: aw.Document, style_name: str):

paragraphs_with_style = []
paragraphs = doc.get_child_nodes(aw.NodeType.PARAGRAPH, True)

for paragraph in paragraphs:
paragraph = paragraph.as_paragraph()
if paragraph.paragraph_format.style.name == style_name:
paragraphs_with_style.append(paragraph)

return paragraphs_with_style

現在我們準備好使用這些方法并從 Word 文檔中提取文本。

在 Word 文檔中的段落之間提取文本

讓我們看看如何在 Word DOCX 文檔的兩個段落之間提取內容。以下是在 Python 中執行此操作的步驟。

首先，使用Document類加載 Word 文檔。
使用Document.first_section.body.get_child(NodeType.PARAGRAPH, int, boolean).as_paragraph()方法將開始和結束段落的引用獲取到兩個對象中。
調用extract_content(startPara, endPara, True)方法將節點提取到對象中.
調用generate_document(Document, extractNodes)輔助方法來創建包含提取內容的文檔。
最后，使用Document.save(string)方法保存返回的文檔。

以下代碼示例展示了如何在 Python 中提取 Word 文檔中第 7 段和第 11 段之間的文本。

# Load document.
doc = aw.Document("Extract content.docx")

# Define starting and ending paragraphs.
startPara = doc.first_section.body.get_child(aw.NodeType.PARAGRAPH, 6, True).as_paragraph()
endPara = doc.first_section.body.get_child(aw.NodeType.PARAGRAPH, 10, True).as_paragraph()

# Extract the content between these paragraphs in the document. Include these markers in the extraction.
extractedNodes = extract_content(startPara, endPara, True)

# Generate document containing extracted content.
dstDoc = generate_document(doc, extractedNodes)

# Save document.
dstDoc.save("extract_content_between_paragraphs.docx")

在 Word 文檔中不同類型的節點之間提取文本

您還可以在不同類型的節點之間提取內容。為了演示，讓我們提取段落和表格之間的內容并將其保存到新的 Word 文檔中。以下是執行此操作的步驟。

使用Document類加載 Word 文檔。
使用Document.first_section.body.get_child(NodeType, int, boolean)方法將起始節點和結束節點引用到兩個對象中。
調用extract_content(startPara, endPara, True)方法將節點提取到對象中。
調用generate_document(Document, extractNodes)輔助方法來創建包含提取內容的文檔。
使用Document.save(string)方法保存返回的文檔。

以下代碼示例展示了如何在 Python 中提取段落和表格之間的文本。

# Load document
doc = aw.Document("Extract content.docx")

# Define starting and ending nodes.
start_para = doc.last_section.get_child(aw.NodeType.PARAGRAPH, 2, True).as_paragraph()
end_table = doc.last_section.get_child(aw.NodeType.TABLE, 0, True).as_table()

# Extract the content between these nodes in the document. Include these markers in the extraction.
extracted_nodes = extract_content(start_para, end_table, True)

# Generate document containing extracted content.
dstDoc = generate_document(doc, extractedNodes)

# Save document.
dstDoc.save("extract_content_between_nodes.docx")

根據樣式提取段落之間的文本

現在讓我們看看如何根據樣式提取段落之間的內容。為了演示，我們將提取 Word 文檔中第一個“標題 1”和第一個“標題 3”之間的內容。以下步驟演示了如何在 Python 中實現此目的。

首先，使用Document類加載 Word 文檔。
然后，使用paragraphs_by_style_name(Document, “Heading 1”)輔助方法將段落提取到一個對象中。
使用paragraphs_by_style_name(Document, “Heading 3”)輔助方法將段落提取到另一個對象中。
調用extract_content(startPara, endPara, True)方法并將兩個段落數組中的第一個元素作為第一個和第二個參數傳遞。
調用generate_document(Document, extractNodes)輔助方法來創建包含提取內容的文檔。
最后，使用Document.save(string)方法保存返回的文檔。

以下代碼示例展示了如何根據樣式提取段落之間的內容。

# Load document
doc = aw.Document("Extract content.docx")

# Gather a list of the paragraphs using the respective heading styles.
parasStyleHeading1 = paragraphs_by_style_name(doc, "Heading 1")
parasStyleHeading3 = paragraphs_by_style_name(doc, "Heading 3")

# Use the first instance of the paragraphs with those styles.
startPara1 = parasStyleHeading1[0]
endPara1 = parasStyleHeading3[0]

# Extract the content between these nodes in the document. Don't include these markers in the extraction.
extractedNodes = extract_content(startPara1, endPara1, False)

# Generate document containing extracted content.
dstDoc = generate_document(doc, extractedNodes)

# Save document.
dstDoc.save("extract_content_between_paragraphs_based_on-Styles.docx")

結論

在本文中，您學習了如何使用 Python 從 MS Word 文檔中提取文本。此外，您還了解了如何以編程方式在 Word 文檔中相似或不同類型的節點之間提取內容。因此，您可以在 Python 中構建自己的 MS Word 文本提取器。此外，您可以使用文檔探索 Aspose.Words for Python 的其他功能。如果您有任何問題，請隨時告訴我們。

歡迎下載|體驗更多Aspose產品

獲取更多信息請咨詢 或加入Aspose技術交流群（761297826）

標簽：

本站文章除注明轉載外，均為本站原創或翻譯。歡迎任何形式的轉載，但請務必注明出處、不得修改原文相關鏈接，如果存在內容上的異議請郵件反饋至chenjj@fc6vip.cn

上一篇：Go開發工具GoLand使用教程（一）：第一次運行下一篇：.NET報表控件TeeChart使用教程：填充新數據系列