C-Command Software Forum

Just installed -- do I need to reset my Corpus already?


I just installed SpamSieve, and I must have missed the part in the instructions where it said that you should train the program with only about 1000 messages, of which about 65% should be spam.

I have gotten many gigabytes of spam over the years, and I used to keep a copy of every spam message I ever got. But a few years back, that became too much hassle, and I stopped keeping track. At the moment, I have only about 600 Junk messages that are currently in my trash in Mail.app – all a month old or less. So, I figured that might be a good part of the training process. Mind you, these are only the messages that make it through the spam-filtering appliances at my employer and at the ISP I use for hosting my vanity domain, and when I helped run the fleet of spam-filtering appliances at UT Austin, we had a spam block rate of 99% or better, with virtually no false positives. I don’t think that these appliances are as effective as the ones we had at UT Austin, but I have to believe that they’re still catching more then 90-95% of the incoming spam.

I think my big mistake was in selecting the messages for training as “Good”. I think either Mail.app or SpamSieve or maybe both died somewhere around the 24,000 mark, and I’ve now got almost 7000 whitelist rules, which I think may be way more than I want/need.

So, is it already time for me to dump my current corpus and re-train again, so that my set of whitelists is not excessively large?


7,000 whitelist rules is fine. But you should reset the corpus and re-train SpamSieve with a smaller number of messages, about 65% spam.


However, I do wonder if there is an easy way to reset the whitelist and blacklist rules back to their defaults? I looked in the documentation and didn’t find anything obvious, but I may not have been looking in the right places.

Also, is there a good list of user-supplied blacklist rules anywhere? I saw the thread with Feste entitled RegEx for Blocklisting From header with ill-formed email addresses, and that seems to be an good one to me. But maybe there are others for matching on non-ASCII charsets, or messages that include unmarked 8-bit binary characters?


The Reset Corpus command is only for resetting the corpus, which is the primary way that SpamSieve filters mail. The whitelist and blocklist can easily be reset by opening the window, selecting all the rules, and pressing Delete. However, I don’t think it’s necessary for you to do this.

It is generally not necessary or recommended to create lots of user-supplied rules. SpamSieve will automatically learn to filter the messages as you train it. The capability for user-created rules is there for expert users with non-typical needs.