Hello. SpamSieve is working very well for me after correcting several spammy text strings that were auto-trained as good weeks or months ago, but but not noticed at the time. I manually retrained those mistakes as spam when possible or if training was greyed out, I manually pruned them from the Allowlist. For a couple of frequently appearing spam strings that often slipped through, I manually added Blocklist rules. Finally, I also nudged the Bayesian classifier one notch more aggressive.
The biggest remaining question for me is about the corpus getting bloated with meaningless strings. After several years of using SpamSieve, my corpus contains over a millions words. The size doesn’t seem to be an issue so far. But I’ve noticed that hundreds of thousands of those words are meaningless strings that are marked as good (see screenshot).
Perhaps its best to leave it alone since everything is working well. And after years of careful training, I have no interest in restarting from scratch. However, I wish there were ways to avoid and/or remove some of this bloat.
In SpamSieve help, I see mention of the ability to delete unwanted records from the corpus. But I don’t seem to have this capability. Highlighting one or more records and pressing Del or Backspace does nothing.
Of course, pruning such large numbers of records would be a massive and potentially error-prone operation. But I can think of ways to safely filter a lot of it using regex searches.