Skip to content

Latest commit

 

History

History

OcrAndExtractText

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

OCR PDF and extract plain text in C# and VB.NET

This sample shows how to recognize and extract text from non-searchable PDF documents using Docotic.Pdf library and Tesseract OCR Engine.

Follow these steps to do OCR when a PDF page does not contain searchable text:

  1. Save the page as high-resolution image using Docotic.Pdf. Higher resolution leads to better recognition quality.
  2. Recognize the image using Tesseract OCR engine.
  3. Use recognized text.

If your documents contain text in language(s) other than English, provide Language Data Files for Tesseract 4.00 for the language(s) of your document.

Also ensure that you have Visual Studio 2015-2019 x86 & x64 runtimes installed.

See also