C-Command Software Forum

Mostly open-source OCR workflow with Tesseract

I thought folks might be interested in my new EagleFiler OCR workflow.

I decided to stop using PDFPen after it removed some functionality to coerce users into a paid “upgrade,” and broke an important file in the process. Getting the open-source Tesseract engine to work is pretty complicated, but fortunately someone wrote a Python script to make it much easier to run Tesseract on multipage PDFs using OSX. All the dependencies can be installed with Homebrew.



brew install tesseract
brew install ghostscript
brew install poppler
brew install imagemagick

And then:

pip install pypdfocr

Instead of using the PyPDFOCR folder watch switch, I use Hazel to monitor a folder (~/Downloads/PDF) for new PDFs, then trigger the Python script when it finds one, ie:

pypdfocr $1 -f -c ~/pypdfocr-config.yaml

$1 is the file to be imported. The config.yaml file includes a line like this, to move OCR’d PDFs to my EagleFiler “To Import” folder:

default_folder: "/PATH/TO/EAGLEFILER/LIBRARY/To\ Import\ (Library)/"

It took me a couple of years to arrive at this. Hope others find it useful.

1 Like

Great solution

This works really, really well! Thank you.