Is there any way to EXCLUDE documents from indexes? To speed up searches.

Rbohn · March 25, 2014, 12:15pm

This is a rather strange request, driven by speed issues in searching. EF searches are very fast, but I am OCR’ing a number of old documents, and fear that searches are starting to slow down. I’m wondering if a solution is to not include certain documents in the indexes.

DB description: I have ~5000 old aircraft manuals and related documents in PDF format, totaling 45 GB (large size is due to very inefficient scanning of some of them). Many manuals are several hundred pages. Because of poor image quality, the OCR of many of the old documents has a lot of nonsense words and spelling errors. I wonder if this increases the size of the index - presumably every pseudo-word has to be in the index. I am also indexing by phrases, which increases index size.
The result seems to be that as I OCR more of these old documents, searches are getting slower. The effect is not bad yet, but I’m thinking ahead.

One workaround would be to exclude these crappy old OCR’d documents from indexing. Another solution would be to not OCR them; but when I work with an individual document, I want to search it, copy from it, etc. A third solution would be to create a separate database for these old manuals, so that the rest of my database does not take a performance hit.

I don’t know any way to tell EF to not index certain documents. Does anyone have suggestions?
I realize this is a very special-purpose request, and creating another database is always a feasible solution.

Michael_Tsai · March 25, 2014, 2:17pm

Please see How does indexing in EagleFiler work?. You could tag the files as ef_noindex (to prevent them from being indexed) and then rebuild the Record index (to clean the unwanted files out of the index).

You are right that long documents with lots of text will use a lot of space in the index, especially when using phrase indexing.

Rbohn · June 18, 2014, 3:38pm

Follow-up
I am using both methods:
Tagging the largest files as ef_noindex, and I have turned off indexing by phrase.

I am also contemplating splitting my database into two parts, which should help a lot. And then one of the two may go onto Dropbox…