Tesseract – OCR

This plugin provides recipes to perform Optical Character Recognition (OCR) using the Tesseract engine

Plugin information

Version 1.0.0
Author Dataiku (Stanislas GUINEL)
Released 2020-06
Last updated 2020-06
License Apache Software License
Source code Github
Reporting issues Github

How to set up

If you are a Dataiku admin user, you need to follow the instructions on the README.md file of the plugin GitHub page in order to install the required packages on the DSS instance machine.

If you are not an admin, you can forward this to your admin and/or scroll down to the How to use section.

Warning: You must first install Tesseract on your machine !

How to use

This plugin has multiple components: Image conversion recipe, Image processing recipe, Text extraction recipe and a notebook template.

Let’s assume that you have a Dataiku DSS project with a folder containing both images and PDFs.

In order to extract text from images and PDFs, you must go through the following steps:

Example of the complete flow (Conversion > Processing > Extraction)

Image conversion

Because the Text extraction recipe only works on greyscale JPG images, you may have to use the Image conversion recipe first.

The Image conversion recipe takes as input a folder of images (JPG/JPEG/PNG/TIFF) and PDFs. It converts them into greyscale JPG images. If a PDF has multiple page, it creates a subfolder with one image per page.

 

You can also set some advanced parameters in the image conversion:

  • DPI (Dot Per Inch): set the DPI of images extracted from PDFs only.
  • Quality: set the quality of images according to the PIL package parameter.
Image conversion parameter’s form

Notebook template

You may want to process images before extracting text from images in order to get better results.

There is a notebook template where you can explore the effect of different image processing techniques.

Go to notebook (G+N) and create a new python notebook. Select the template `Image processing for text extraction` and then check that the plugin code env is selected (you can set it in the tab Kernel > Change kernel).
Choose the Image processing template when creating a new notebook
Then, you can use the pre-defined functions or write your owns in order to explore different types of image processing. You need to enter the input folder id manually in the notebook.
In the notebook, you can visualize the effect of image processing functions using the function display_images_before_after defined in the notebook (display image before and after processing):
Visualize the effect of image processing functions
You can also look at the extracted text before and after image processing using the function text_extraction_before_after defined in the notebook:
Check the extracted text depending on image processing functions

Image processing

This recipe will process each greyscale JPG images of the input folder using the functions defined by the user in the recipe parameter’s form. Both input and output of these functions are numpy array image.

You can copy the functions you want from the notebook for example and paste them in the Image Processing recipe form:
Image processing parameter’s form

Text extraction

Finally, this last Text extraction recipe takes as input a folder of greyscale JPG images and outputs a dataset with two columns: filename and extracted text from tesseract.

If some images of the input folder were extracted from the same multiple-page PDF in the Image conversion recipe (meaning that there are in the same subfolder with a specific name pattern: <PDF_NAME>_pdf_page_XXXXX.jpg), you can choose to concatenate their extracted text.

You can also specify the language to be used by tesseract by entering its code (languages must be installed beforehand, ask your admin).

Text extraction parameter’s form

Get the Dataiku Data Sheet

Learn everything you ever wanted to know about Dataiku (but were afraid to ask), including detailed specifications on features and integrations.

Get the data sheet