Pages

Wednesday, March 5, 2014

Tesseract OCR Engine installation and configuration with Leptonica Library on Ubuntu 12.04 LTS

Hi Guys,

Today, we will see how can we install and configure Tesseract OCR Engine on Ubuntu System.
But, before we proceed further, let me quick introduce you about Tesseract.

Tesseract is probably the most accurate open source OCR engine available. Combined with the Leptonica Image Processing Library it can read a wide variety of image formats and convert them to text in over 60 languages. It was one of the top 3 engines in the 1995 UNLV Accuracy test.

Here is the information of Tesseract versions and required Loptonica versions for the respective Tesseract version.

Tesseract 3.01 requires at least v1.67 of Leptonica.
Tesseract 3.02 requires at least v1.69 of Leptonica. (Both available in Ubuntu 12.04 Precise Pangolin.)
Tesseract 3.03 requires at least v1.70 of Leptonica. (Both available in Ubuntu 14.04 Trusty Tahr.)


Here I am showing you the steps for the installation of Tesseract 3.02 with Leptonica v1.70
Now, follow the steps for installation and configuration:

1)
First of all make sure that your system is ready with the required packages:
If they are not already installed, you need the following libraries (Ubuntu):
sudo apt-get install autoconf automake libtool
sudo apt-get install libpng12-dev
sudo apt-get install libjpeg62-dev
sudo apt-get install libtiff4-dev
sudo apt-get install g++
2) 
Now, you need to install imagemagick tool, which helps you to convert your scanned PDF files to tif file(which is preferable for Tesseract).

You can install imagemagick with the help of following command:

sudo apt-get install imagemagick

You can use imagemagick through command-line like this:

convert -density 300 samplescanneddocument.pdf -depth 8 samplescanneddocument.tif

 3)
 Now, you need to download and configure Leptonica Image Processing Library which is required to complie Tesseract.
But, before we go further; let me brief you about Leptonica:

Leptonica is a pedagogically-oriented open source site containing software that is broadly useful for image processing and image analysis applications.
Featured operations are
  • Rasterop (a.k.a. bitblt)
  • Affine transformations (scaling, translation, rotation, shear) on images of arbitrary pixel depth
  • Binary and grayscale morphology, rank order, and convolution
  • Seedfill and connected components
  • Image transformations combining changes in scale and pixel depth
  • Pixelwise masking, blending, enhancement, arithmetic ops, etc.
For more details: visit https://www.leptonica.com

Follow these commands to download and compile the Leptonica Library:
wget http://www.leptonica.org/source/leptonica-1.70.tar.gz
tar -zxvf leptonica-1.70.tar.gz
cd leptonica-1.70/
./autobuild
./configure
make
sudo make intall
sudo ldconfig

4)  
Now, we can actually get and install Tesseract..!!! But remember to go back one directory from the above install of Leptonica.
cd .. 
5)
Now, let's download and configure Tesseract. Follow these steps:
wget https://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.02.tar.gz
tar -zxvf tesseract-ocr-3.02.02.tar.gz
cd tesseract-ocr/
./autogen.sh
./configure
make
sudo make install
sudo ldconfig


6)
After successful configuration, set TESSDATA_PREFIX environment variable which points to your tessdata folder which is located under /usr/local/share/ directory.

you can set it using following command:
export TESSDATA_PREFIX=/usr/local/share/

7)
Now, download Tesseract English Language data and copy it to TESSDATA_PREFIX location.

cd ..
wget https://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.eng.tar.gz
tar -xf tesseract-ocr-3.02.eng.tar.gz
sudo cp -r tesseract-ocr/tessdata $TESSDATA_PREFIX

8)
That's it! 
To use Tesseract go into the directory with your scanned PDF (or whatever it is). I will get both plain and hocr output:

cd /home/dhaval/Downloads/
convert -density 300 scansmpl.pdf -depth 8 scansmpl.tif
Tesseract scansmpl.tif outputtext
Tesseract scansmpl.tif outputtext hocr

Here, hocr allows us to pinpoint the actual images over the original. You could use something like hocr2pdf ("sudo apt-get install exactimage") to remerge the pdf and hocr output to make searchable PDFs.
Verify your hocr configuration:

cd /usr/local/share/tessdata/configs/
sudo vi hocr

verify that following line should be there to generate hocr output:

tessedit_create_hocr 1

Now, if you want to integrate Tesseract with JAVA then there is a Tess4J API which is a Java JNA wrapper for Tesseract OCR API.

The library provides optical character recognition (OCR) support for:
  • TIFF, JPEG, GIF, PNG, and BMP image formats
  • Multi-page TIFF images
  • PDF document format
So, we are done..
Hope this post may helpful to you.
Please let me know if you have any problem or if you are facing Tess4J integration with Tesseract.

Happy Coding...:)