Plugin information
Version | 1.0.2 |
---|---|
Author | Dataiku |
Released | 2020-06 |
Last updated | 2021-11 |
License | Apache Software License |
Source code | Github |
Reporting issues | Github |
How to set up
If you are a Dataiku admin user, you need to follow the instructions on the README.md file of the plugin GitHub page in order to install the required packages on the DSS instance machine.
If you are not an admin, you can forward this to your admin and/or scroll down to the How to use section.
Warning: You must first install Tesseract on your machine !
How to use
This plugin has multiple components: Image conversion recipe, Image processing recipe, Text extraction recipe and a notebook template.
Let’s assume that you have a Dataiku DSS project with a folder containing both images and PDFs.
In order to extract text from images and PDFs, you must go through the following steps:

Image conversion
Because the Text extraction recipe only works on greyscale JPG images, you may have to use the Image conversion recipe first.
The Image conversion recipe takes as input a folder of images (JPG/JPEG/PNG/TIFF) and PDFs. It converts them into greyscale JPG images. If a PDF has multiple page, it creates a subfolder with one image per page.
You can also set some advanced parameters in the image conversion:
- DPI (Dot Per Inch): set the DPI of images extracted from PDFs only.
- Quality: set the quality of images according to the PIL package parameter.

Notebook template
You may want to process images before extracting text from images in order to get better results.
There is a notebook template where you can explore the effect of different image processing techniques.



Image processing
This recipe will process each greyscale JPG images of the input folder using the functions defined by the user in the recipe parameter’s form. Both input and output of these functions are numpy array image.

Text extraction
Finally, this last Text extraction recipe takes as input a folder of greyscale JPG images and outputs a dataset with two columns: filename and extracted text from tesseract.
If some images of the input folder were extracted from the same multiple-page PDF in the Image conversion recipe (meaning that there are in the same subfolder with a specific name pattern: <PDF_NAME>_pdf_page_XXXXX.jpg), you can choose to concatenate their extracted text.
You can also specify the language to be used by tesseract by entering its code (languages must be installed beforehand, ask your admin).
