Self-updating the corpus

Aaron · March 25, 2010, 4:12pm

I don’t like having to reset my corpus and retrain. Can’t this be done automatically from a global corpus that everyone contributes to?

Michael_Tsai · March 25, 2010, 5:02pm

Why are you resetting the corpus? That’s not normally necessary.

No, because the accuracy is much better if you train SpamSieve with your own mail.

Aaron · March 25, 2010, 5:32pm

SS was prompting me to do so.

Isn’t there something that can be done globally to improve accuracy? E.g. catching new kinds of spam for which I haven’t yet trained locally for.

Michael_Tsai · March 25, 2010, 5:55pm

What did it say, exactly? I can’t think of any situations in which a recent version of SpamSieve would tell you to reset the corpus.

The changes in spam are usually not so dramatic that SpamSieve would suddenly go from catching them to letting lots of them through. Instead, it would start to see them as spam but closer to the borderline, or perhaps a couple getting through. So SpamSieve’s focus is on quick learning and staying up-to-date on new types of spam that you receive (even when you aren’t training it, because it isn’t making mistakes). I think this makes more sense than having a global system feeding in data that may not have any relation to the spam that you’re receiving (or will receive). The current system seems to be working very well for people, although if you’re not seeing good results, please send in a report.