I just installed SpamSieve, and I must have missed the part in the instructions where it said that you should train the program with only about 1000 messages, of which about 65% should be spam.
I have gotten many gigabytes of spam over the years, and I used to keep a copy of every spam message I ever got. But a few years back, that became too much hassle, and I stopped keeping track. At the moment, I have only about 600 Junk messages that are currently in my trash in Mail.app – all a month old or less. So, I figured that might be a good part of the training process. Mind you, these are only the messages that make it through the spam-filtering appliances at my employer and at the ISP I use for hosting my vanity domain, and when I helped run the fleet of spam-filtering appliances at UT Austin, we had a spam block rate of 99% or better, with virtually no false positives. I don’t think that these appliances are as effective as the ones we had at UT Austin, but I have to believe that they’re still catching more then 90-95% of the incoming spam.
I think my big mistake was in selecting the messages for training as “Good”. I think either Mail.app or SpamSieve or maybe both died somewhere around the 24,000 mark, and I’ve now got almost 7000 whitelist rules, which I think may be way more than I want/need.
So, is it already time for me to dump my current corpus and re-train again, so that my set of whitelists is not excessively large?