I thought folks might be interested in my new EagleFiler OCR workflow.
I decided to stop using PDFPen after it removed some functionality to coerce users into a paid “upgrade,” and broke an important file in the process. Getting the open-source Tesseract engine to work is pretty complicated, but fortunately someone wrote a Python script to make it much easier to run Tesseract on multipage PDFs using OSX. All the dependencies can be installed with Homebrew.
brew install tesseract brew install ghostscript brew install poppler brew install imagemagick
pip install pypdfocr
Instead of using the PyPDFOCR folder watch switch, I use Hazel to monitor a folder (~/Downloads/PDF) for new PDFs, then trigger the Python script when it finds one, ie:
pypdfocr $1 -f -c ~/pypdfocr-config.yaml
$1 is the file to be imported. The config.yaml file includes a line like this, to move OCR’d PDFs to my EagleFiler “To Import” folder:
default_folder: "/PATH/TO/EAGLEFILER/LIBRARY/To\ Import\ (Library)/"
It took me a couple of years to arrive at this. Hope others find it useful.