日韩综合第一页,日本不卡的中文字幕,91社区福利

快速入手光學字符識別控件Aspose.OCR！如何從PDF中提取文本

翻譯|使用教程|編輯：顏馨|2023-05-16 10:09:01.360|閱讀 221 次

概述：本章介紹如何在C#中對PDF文檔進行OCR并從PDF中提取文本

Aspose.OCR是一款字符識別組件，它使得開發人員可以添加OCR功能到他們的ASP.NET Web應用程序、web服務和windows應用程序中。它提供了一個簡單的類集用于控制字符識別。Aspose.OCR目的是為那些需要在他們自己的應用程序中使用圖像（BMP和TIFF）的開發人員提供需求。它允許開發人員快速而簡單的從圖像中提取文本，并節省了從頭開發一個OCR解決方案的時間和精力。

Aspose API支持流行文件格式處理，并允許將各類文檔導出或轉換為固定布局文件格式和最常用的圖像/多媒體格式。

Aspose.OCR 最新下載

PDF 文件是最常見的業務文檔之一。在某些情況下，我們可能需要以編程方式閱讀掃描的PDF文檔。從掃描的PDF文件中提取文本的困難導致了工具的開發，這些工具可以更輕松地從此類PDF文檔中閱讀和檢索文本。根據文檔的內容，出于多種原因，從 PDF 文件中提取文本可能很有用。在本文中，我們將學習如何在C#中對PDF文檔進行OCR并從PDF中提取文本。

OCR PDF 到文本 C# API

我們將使用 Aspose.OCR for .NET API 對 PDF 文檔執行 OCR。它可以識別掃描的圖像，智能手機照片，屏幕截圖和圖像區域。API 以最流行的文檔和數據交換格式返回識別的文本結果。除了將圖像轉換為文本外，API 還可以根據掃描創建可搜索的 PDF。此外，它能夠自動更正已識別文本中的拼寫錯誤。

該 API 提供了 AsposeOcr 類，該類提供了執行 OCR 操作的各種方法。它提供了RecognizePdf（字符串，DocumentRecognitionSettings）方法來識別所提供的PDF文檔中的文本。API 的 DocumentRecognitionSettings 類提供 PDF 識別過程的設置。類表示圖像識別的結果。

OCR PDF 和從 C 語言的 PDF 中提取文本

我們可以對PDF文檔執行OCR，并按照以下步驟提取識別的文本：

首先，創建 AsposeOcr 類的實例。
接下來，初始化 DocumentRecognitionSettings 類的對象。
然后，指定要用于 OCR 的語言。
之后，通過調用 RecognizePdf（）方法獲取 RecognitionResult。它采用圖像路徑和文檔識別設置對象作為參數。
最后，循環瀏覽識別結果列表并顯示標識的文本。

以下示例代碼演示如何在 C# 中對 PDF 文檔進行 OCR 和提取識別的文本。

// This code example demonstrates how to OCR PDF documents and extract the recognized text.
// Initialize the PCR engine
AsposeOcr recognitionEngine = new AsposeOcr();

// Initialize recognition settings
DocumentRecognitionSettings recognitionSettings = new DocumentRecognitionSettings();

// Specify language for OCR. Multi-language by default
recognitionSettings.Language = Language.Eng;

// Recognize text from PDF
List<RecognitionResult> results = recognitionEngine.RecognizePdf("C:\\Files\\sample.pdf", recognitionSettings);

// Show the recognized text
foreach (RecognitionResult result in results)
{
Console.WriteLine(result.RecognitionText);
}

OCR PDF 和從 C 語言的 PDF 中提取文本#

對 PDF 執行 OCR 并將文本保存在 C 語言中

我們可以對PDF文檔執行OCR，并按照以下步驟保存識別的文本：

首先，創建 AsposeOcr 類的實例。
接下來，初始化 DocumentRecognitionSettings 類的對象。
然后，指定要用于 OCR 的語言。
之后，調用 RecognizePdf（）方法來獲取 RecognitionResult。它采用圖像路徑和文檔識別設置對象作為參數。
最后，使用 SaveMultipageDocument（）方法保存文本。它采用輸出文件路徑、SaveFormat 和 RecognitionResult 對象作為參數。

以下示例代碼演示如何對 PDF 文檔進行 OCR 并將識別的文本保存在 C# 中。

// This code example demonstrates how to OCR PDF documents and extract the recognized text.
// Initialize the PCR engine
AsposeOcr recognitionEngine = new AsposeOcr();

// Initialize recognition settings
DocumentRecognitionSettings recognitionSettings = new DocumentRecognitionSettings();

// Specify language for OCR. Multi-language by default
recognitionSettings.Language = Language.Eng;

// Recognize text from PDF
List<RecognitionResult> results = recognitionEngine.RecognizePdf("C:\\Files\\sample.pdf", recognitionSettings);

// Save the recognized text
AsposeOcr.SaveMultipageDocument("C:\\Files\\OCR_result.txt", SaveFormat.Text, results);

OCR PDF 和將掃描的 PDF 轉換為 C 語言中的單詞

我們可以對掃描的PDF文檔執行OCR，并按照前面提到的步驟將識別的文本保存在Word文檔中。但是，我們只需要在最后一步中指定 SaveFormat.Docx。

下面的示例代碼演示如何在 C# 中對 PDF 進行 OCR PDF 并將識別的文本另存為 Word 文檔。

// This code example demonstrates how to OCR PDF documents and save the recognized text as DOCX.
// Initialize the PCR engine
AsposeOcr recognitionEngine = new AsposeOcr();

// Initialize recognition settings
DocumentRecognitionSettings recognitionSettings = new DocumentRecognitionSettings();

// Specify language for OCR. Multi-language by default
recognitionSettings.Language = Language.Eng;

// Recognize text from PDF
List<RecognitionResult> results = recognitionEngine.RecognizePdf("C:\\Files\\sample.pdf", recognitionSettings);

// Save the recognized text as DOCX
AsposeOcr.SaveMultipageDocument("C:\\Files\\OCR_result.docx", SaveFormat.Docx, results);

OCR PDF 和將掃描的 PDF 轉換為 C 語言中的單詞#

OCR PDF 和將 PDF 轉換為 JSON 語言

我們可以對 PDF 文檔執行 OCR，并按照前面提到的步驟將識別的文本保存在 JSON 文件中。但是，我們只需要在最后一步中指定 SaveFormat.Json。

以下示例代碼演示如何在 C# 中對 PDF 進行 OCR PDF 并將識別的文本另存為 JSON 文件。

// This code example demonstrates how to OCR PDF documents and save the recognized text as JSON.
// Initialize the PCR engine
AsposeOcr recognitionEngine = new AsposeOcr();

// Initialize recognition settings
DocumentRecognitionSettings recognitionSettings = new DocumentRecognitionSettings();

// Specify language for OCR. Multi-language by default
recognitionSettings.Language = Language.Eng;

// Recognize text from PDF
List<RecognitionResult> results = recognitionEngine.RecognizePdf("C:\\Files\\sample.pdf", recognitionSettings);

// Save the recognized text as JSON
AsposeOcr.SaveMultipageDocument("C:\\Files\\OCR_result.json", SaveFormat.Json, results);

以上便是如何對 PDF 文檔執行 OCR 以及如何在 C# 中從 PDF 中提取文本的詳細步驟，希望能幫到您，若有其他問題歡迎加入我們的技術交流群，或關注我們。

歡迎下載|體驗更多Aspose產品

獲取更多信息請咨詢或加入Aspose技術交流群（761297826）