Free OCR solution on OS X

Posted on October 30, 2013 • 2 minutes • 321 words

Tesseract OCR is probably the best open source OCR engine available. It allows you to convert text from an image.

Install tesseract

I suppose you already have homebrew installed. If not, copy and paste this into Terminal.

ruby -e "$(curl -fsSL https://raw.github.com/mxcl/homebrew/go)"

Once you got homebrew, installing tesseract is as simple as brew install tesseract

Automate the process

tesseract accepts input as image. If you input is PDF, you will have to convert it to image first, perferably TIFF format. You probably will want to automate this process with Automator.

Fire up Automator and create a new application.

Search for the action “Render PDF Pages as Images” and drag it to the right. Change the format to TIFF and increase DPI to 300 (The higher, the better accuracy rate)

Search and drag action “Rename Finder Items” to the workflow on the right and change it as the following image. Remember to tick “Make all number 4 digits long” just to ensure the numbering is right if you got lots of pages in the PDF file.

Add action “Move Finder Items” to the workflow. Select your desired folder as you see fit.

Now, this is the part where where OCR begins. We will convert each image to text by passing each image to tesseract. We then will append the output to our output.txt file. Remember to change tesseract path according to your tesseract version number. (It’s 3.02.02 as of this post)

The last step in the script is removing the images and the buffered text files. It’s optional.

rm ~/Desktop/out.txt
for f in "$@"
do
  /usr/local/Cellar/tesseract/3.02.02/bin/tesseract "$f" "$f"
  cat "$f.txt" >> ~/Desktop/out.txt
  rm "$f" "$f.txt"
done

The last step looks like this in Automator

Save this Automator workflow to Desktop or wherever you want.

Next time you need to OCR a PDF document, just drag and drop it over the application’s icon. The output text will be saved as output.txt on Desktop.