Can I simply delete old corpus entries?


Long time, satisfied Spamsieve user. About every 12-18 months, I suddenly find my corpus has (surprise!) grown large and my accuracy has dropped off sharply.

Today’'s stats:

Filtered Mail
8,759 Good Messages
19,105 Spam Messages (69%)
37 Spam Messages Per Day

SpamSieve Accuracy
82 False Positives
467 False Negatives (85%)
98.0% Correct

3,150 Good Messages
5,101 Spam Messages (62%)
278,600 Total Words

12,949 Blocklist Rules
10,072 Whitelist Rules

Showing Statistics Since
1/1/08 9:40 PM

The specter of resetting the corpus and retraining from scratch loomed before me. I shuddered.

So today I tried something different. I went into the 2 rules lists and deleted all entries with a zero in the hits field. I then went into the corpus and deleted everything from 2008. This trimmed all those entries considerably.

Will this work or do I really have to suck it up, reset the corpus and start fresh?

That will speed it up, but it won’t affect the accuracy. Resetting the corpus and re-training shouldn’t be a big deal because you only need to use a few hundred messages these days. Of course, if the accuracy has dropped off suddenly, you should first check that the problem is actually with the corpus, rather than in the settings for your mail program or SpamSieve.

Okay, I’ll start saving up my spam until I reach 400 and then retrain with those and 240 good messages I have already saved and filed.

But what numbers do I want to see in the Stats window after retraining?

Right now it shows 98% correct and 85% false positives. Shouldn’t the correct number be higher, like around 99+? And what about the false positives? I’ve seen higher than 85% but that stat is usually much lower in the stats posted by others on this forum.

Yes, it should be above 99%. Right now, you have it set to show the average since January 2008. You would need to change the date in order to track the more recent statistics.

You want most of the mistakes to be false negatives rather than false positives, but in most cases there’s probably not much you can do to affect this number, as it depends on the kind of mail you receive. If the overall accuracy is good, this number should be good, too.