I thought folks might be interested in my new EagleFiler OCR workflow.
I decided to stop using PDFPen after it removed some functionality to coerce users into a paid “upgrade,” and broke an important file in the process. Getting the open-source Tesseract engine to work is pretty complicated, but fortunately someone wrote a Python script to make it much easier to run Tesseract on multipage PDFs using OSX. All the dependencies can be installed with Homebrew.
https://github.com/virantha/pypdfocr
So:
brew install tesseract
brew install ghostscript
brew install poppler
brew install imagemagick
And then:
pip install pypdfocr
Instead of using the PyPDFOCR folder watch switch, I use Hazel to monitor a folder (~/Downloads/PDF) for new PDFs, then trigger the Python script when it finds one, ie:
pypdfocr $1 -f -c ~/pypdfocr-config.yaml
$1 is the file to be imported. The config.yaml file includes a line like this, to move OCR’d PDFs to my EagleFiler “To Import” folder:
default_folder: "/PATH/TO/EAGLEFILER/LIBRARY/To\ Import\ (Library)/"
It took me a couple of years to arrive at this. Hope others find it useful.