Hi Guys,
Here is the information of Tesseract versions and required Loptonica versions for the respective Tesseract version.
Tesseract 3.01 requires at least v1.67 of Leptonica.
Tesseract 3.02 requires at least v1.69 of Leptonica. (Both available in Ubuntu 12.04 Precise Pangolin.)
Tesseract 3.03 requires at least v1.70 of Leptonica. (Both available in Ubuntu 14.04 Trusty Tahr.)
Here I am showing you the steps for the installation of Tesseract 3.02 with Leptonica v1.70
1)
First of all make sure that your system is ready with the required packages:
If they are not already installed, you need the following libraries (Ubuntu):
You can install imagemagick with the help of following command:
sudo apt-get install imagemagick
You can use imagemagick through command-line like this:
convert -density 300 samplescanneddocument.pdf -depth 8 samplescanneddocument.tif
3)
Now, you need to download and configure Leptonica Image Processing Library which is required to complie Tesseract.
But, before we go further; let me brief you about Leptonica:
Leptonica is a pedagogically-oriented open source site containing software that is broadly useful for image processing and image analysis applications.
Follow these commands to download and compile the Leptonica Library:
wget http://www.leptonica.org/source/leptonica-1.70.tar.gz
tar -zxvf leptonica-1.70.tar.gz
cd leptonica-1.70/
./autobuild
./configure
make
sudo make intall
sudo ldconfig
4)
Now, we can actually get and install Tesseract..!!! But remember to go back one directory from the above install of Leptonica.
cd ..
5)
Now, let's download and configure Tesseract. Follow these steps:
wget https://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.02.tar.gz
tar -zxvf tesseract-ocr-3.02.02.tar.gz
cd tesseract-ocr/
./autogen.sh
./configure
make
sudo make install
sudo ldconfig
6)
After successful configuration, set TESSDATA_PREFIX environment variable which points to your tessdata folder which is located under /usr/local/share/ directory.
you can set it using following command:
export TESSDATA_PREFIX=/usr/local/share/
7)
Now, download Tesseract English Language data and copy it to TESSDATA_PREFIX location.
cd ..
wget https://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.eng.tar.gz
tar -xf tesseract-ocr-3.02.eng.tar.gz
sudo cp -r tesseract-ocr/tessdata $TESSDATA_PREFIX
8)
That's it!
To use Tesseract go into the directory with your scanned PDF (or whatever it is). I will get both plain and hocr output:
cd /home/dhaval/Downloads/
convert -density 300 scansmpl.pdf -depth 8 scansmpl.tif
Tesseract scansmpl.tif outputtext
Tesseract scansmpl.tif outputtext hocr
Today, we will see how can we install and configure Tesseract OCR Engine on Ubuntu System.
But, before we proceed further, let me quick introduce you about Tesseract.
Tesseract is probably the most accurate open source OCR engine available. Combined with the Leptonica Image Processing Library it
can read a wide variety of image formats and convert them to text in
over 60 languages. It was one of the top 3 engines in the 1995 UNLV
Accuracy test.
For more details: visit https://code.google.com/p/tesseract-ocr/
Here is the information of Tesseract versions and required Loptonica versions for the respective Tesseract version.
Tesseract 3.01 requires at least v1.67 of Leptonica.
Tesseract 3.02 requires at least v1.69 of Leptonica. (Both available in Ubuntu 12.04 Precise Pangolin.)
Tesseract 3.03 requires at least v1.70 of Leptonica. (Both available in Ubuntu 14.04 Trusty Tahr.)
Here I am showing you the steps for the installation of Tesseract 3.02 with Leptonica v1.70
Now, follow the steps for installation and configuration:
1)
If they are not already installed, you need the following libraries (Ubuntu):
sudo apt-get install autoconf automake libtool
sudo apt-get install libpng12-dev
sudo apt-get install libjpeg62-dev
sudo apt-get install libtiff4-dev
sudo apt-get install g++
2)
Now, you need to
install imagemagick tool, which helps you to convert your scanned PDF
files to tif file(which is preferable for Tesseract).You can install imagemagick with the help of following command:
sudo apt-get install imagemagick
You can use imagemagick through command-line like this:
convert -density 300 samplescanneddocument.pdf -depth 8 samplescanneddocument.tif
3)
Now, you need to download and configure Leptonica Image Processing Library which is required to complie Tesseract.
But, before we go further; let me brief you about Leptonica:
Leptonica is a pedagogically-oriented open source site containing software that is broadly useful for image processing and image analysis applications.
Featured operations are
- Rasterop (a.k.a. bitblt)
- Affine transformations (scaling, translation, rotation, shear) on images of arbitrary pixel depth
- Binary and grayscale morphology, rank order, and convolution
- Seedfill and connected components
- Image transformations combining changes in scale and pixel depth
- Pixelwise masking, blending, enhancement, arithmetic ops, etc.
Follow these commands to download and compile the Leptonica Library:
wget http://www.leptonica.org/source/leptonica-1.70.tar.gz
tar -zxvf leptonica-1.70.tar.gz
cd leptonica-1.70/
./autobuild
./configure
make
sudo make intall
sudo ldconfig
4)
Now, we can actually get and install Tesseract..!!! But remember to go back one directory from the above install of Leptonica.
cd ..
5)
Now, let's download and configure Tesseract. Follow these steps:
wget https://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.02.tar.gz
tar -zxvf tesseract-ocr-3.02.02.tar.gz
cd tesseract-ocr/
./autogen.sh
./configure
make
sudo make install
sudo ldconfig
6)
After successful configuration, set TESSDATA_PREFIX environment variable which points to your tessdata folder which is located under /usr/local/share/ directory.
you can set it using following command:
export TESSDATA_PREFIX=/usr/local/share/
7)
Now, download Tesseract English Language data and copy it to TESSDATA_PREFIX location.
cd ..
wget https://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.eng.tar.gz
tar -xf tesseract-ocr-3.02.eng.tar.gz
sudo cp -r tesseract-ocr/tessdata $TESSDATA_PREFIX
8)
That's it!
To use Tesseract go into the directory with your scanned PDF (or whatever it is). I will get both plain and hocr output:
cd /home/dhaval/Downloads/
convert -density 300 scansmpl.pdf -depth 8 scansmpl.tif
Tesseract scansmpl.tif outputtext
Tesseract scansmpl.tif outputtext hocr
Here, hocr allows us to pinpoint the actual images over the original. You could use something like hocr2pdf ("sudo apt-get install exactimage") to remerge the pdf and hocr output to make searchable PDFs.
Verify your hocr configuration:
cd /usr/local/share/tessdata/configs/
sudo vi hocr
verify that following line should be there to generate hocr output:
tessedit_create_hocr 1
Now, if you want to integrate Tesseract with JAVA then there is a Tess4J API which is a Java JNA wrapper for Tesseract OCR API.
Hope this post may helpful to you.
Please let me know if you have any problem or if you are facing Tess4J integration with Tesseract.
Happy Coding...:)
Now, if you want to integrate Tesseract with JAVA then there is a Tess4J API which is a Java JNA wrapper for Tesseract OCR API.
The library provides optical character recognition (OCR) support for:
- TIFF, JPEG, GIF, PNG, and BMP image formats
- Multi-page TIFF images
- PDF document format
Hope this post may helpful to you.
Please let me know if you have any problem or if you are facing Tess4J integration with Tesseract.
Happy Coding...:)