Corpus Bloated with Meaningless Word Fragments

steppinwolf · December 28, 2024, 2:17am

Hello. SpamSieve is working very well for me after correcting several spammy text strings that were auto-trained as good weeks or months ago, but but not noticed at the time. I manually retrained those mistakes as spam when possible or if training was greyed out, I manually pruned them from the Allowlist. For a couple of frequently appearing spam strings that often slipped through, I manually added Blocklist rules. Finally, I also nudged the Bayesian classifier one notch more aggressive.

The biggest remaining question for me is about the corpus getting bloated with meaningless strings. After several years of using SpamSieve, my corpus contains over a millions words. The size doesn’t seem to be an issue so far. But I’ve noticed that hundreds of thousands of those words are meaningless strings that are marked as good (see screenshot).

Perhaps its best to leave it alone since everything is working well. And after years of careful training, I have no interest in restarting from scratch. However, I wish there were ways to avoid and/or remove some of this bloat.

In SpamSieve help, I see mention of the ability to delete unwanted records from the corpus. But I don’t seem to have this capability. Highlighting one or more records and pressing Del or Backspace does nothing.

Of course, pruning such large numbers of records would be a massive and potentially error-prone operation. But I can think of ways to safely filter a lot of it using regex searches.

Michael_Tsai · December 28, 2024, 2:44am

If they are grayed out in the log (because the messages were received longer ago than your storage settings) you may still be able to find them in the Corpus window or in your mail client. That would allow you to fix the words in the corpus (and the training will also fix the allowlist).

This shouldn’t really cause problems because with SpamSieve 3 the corpus is no longer loaded in RAM. That said, I’m working on a feature to auto-prune words that have not been used much.

This is a mistake in the help, as that feature was removed in SpamSieve 3. I think the auto-pruning will be easier, faster, and less error-prone.

steppinwolf · December 28, 2024, 4:08am

Thanks for the quick support! For some reason it didn’t occur to me to retrain from inside the corpus when something is no longer available to train in the log or email client. I’ll try that in the future.

An auto-pruning feature for the corpus sounds awesome. I agree that would be easier, faster and safer.