Extract PDF Text With OCR
Description
This Activity reads the contents of the PDF text, including headers, and extracts the text.
Properties
Input
- From Page Number – Set the page extraction mode into "Range" and specify the page numbers to start the extraction.
- Image Format – Specify the image format to save the extracted images.
- Image Resize Percentage – Allows you to rescale an image by the mentioned percentage.
- OCR Engine – An instance of an OCR engine returned by one of the following activities.
- Page Extraction Mode – Set the page extraction mode to "All," "Single," or "Range" to continue the extraction.
- Password – TSets the password to the PDF file, if necessary.
- PDF File Path – The name of the PDF file from where you want to extract the text.
- Single Page Number – Set the page extraction mode to "Single" and specify the page number to extract text.
- To Page Number – Set the page extraction mode to "Range" and specify which page to extract the text from.
Misc
- DisplayName – Add a display name to your Activity.
- Private – By default, Activity will log the values of your properties inside your workflow. If private is selected, then it stops logging.
Optional
- Continue On Error – It Specifies whether the automation should continue even when the Activity throws an error. If True, the Activity continues without throwing any exceptions. If False, the Activity throws an exception. The default value is False.
note
Catches no error if this Activity is present inside the Try-Catch block and the value of this property is True.
Tesseract
- Page Segmentation Mode - Set the page segmentation mode used for extracting data by Tesseract.
Output
- Result – It displays the input text extracted from the PDF file using the OCR engine.
Example
Download Example