corpus 'larger than necessary' (again)

ophiochos · February 19, 2007, 9:09am

I get the tip: SS’s corpus is larger than necessary…’. I got this about year ago, ditched the corpus and started again. I faithfully train it with mistakes. It seems frustrating to go back to square one (last time it took quite a bit of training to get things right again) and I don’t want to lose whitelists/blacklists and so on. Accuracy is not that high though - most spam is getting above 20 score and most legit email is below 10, but that’s not a huge margin of error.

So what can I do? keep training and ignore this tip? or go through it all again? (and is there a way to make sure I don’t get this again in another year or so?) I turned off ‘allow duplicates’ (both settings) some time ago and also ‘auto-train’.

The corpus contains 4,434 messages, 166,325 words and I’m using mailsmith.

cheers

Michael_Tsai · February 19, 2007, 10:37am

How many messages did you start with? It’s recommended to use 1,000 or fewer.

Good. How much mail do you get? If you started with 1,000 messages and now have 4,434 messages, that seems like an unusually high number of mistakes, unless you get a ton of mail.

If you save up some good and spam messages before resetting and re-training, it should start out with a high level of accuracy. However, I do realize that re-training is a bit of a pain, and I’m working to improve this.

Resetting the corpus does not affect the whitelist or blocklist.

What does the Statistics window say that the accuracy is? (Try setting the date to about a month ago to see the recent accuracy.) Good messages should have scores below 50, and spam messages should have scores above 50. So it’s good if the good messages have scores below 10—that means SpamSieve was very sure that they were good. I don’t understand what you mean about the 20, though.

ophiochos · February 20, 2007, 1:07am

stats
sorry, I should have posted the stats yesterday.

Filtered Mail
15,472 Good Messages
9,239 Spam Messages (37%)
75 Spam Messages Per Day

SpamSieve Accuracy
17 False Positives
632 False Negatives (97%)
97.4% Correct

Corpus
1,684 Good Messages
2,754 Spam Messages (62%)
166,431 Total Words

Rules
2,890 Blocklist Rules
7,465 Whitelist Rules

Showing Statistics Since
20/10/2006 12:00

the last month’s are similar (96.7% correct, 79 messages per day). I guess this answers it really, that it’s accurate and there’s no urgency about pruning. My concern is that spam seems to change its content these days, so I can’t archive a set of messages for re-use when I DO later prune it. What exactly can go wrong with a large corpus? So many words that the specificity of my usual emails and the normal spam gets lost?

ophiochos · February 20, 2007, 1:20am

point by point
I’m having trouble posting a set of responses in Camino (it claims there’s no content) so just in sequence to your questions:

How much mail? a Lot:-)

I did start with about 900 last time but didn’t keep the messages I used.

good that resetting does not affect whitelists etc. thanks for that info.

you don’t understand what I mean about 20: typically, a spam message that makes it through to my inbox has a score of about 25. The majority of good messages have only 2 or less. The ones that get caught are higher, 50 and above. But some list messages get over 25, especially ones from a LaTeX list (all those funny words? so I have to leave it at that level.

thanks for the clarifications Michael. keep up the good work!

Jason Davies

Michael_Tsai · February 20, 2007, 6:27am

Actually, I think 97% is kind of low. It should be over 99%.

Pruning was a new feature in SpamSieve 1.1 that would semi-automatically remove words from the corpus that hadn’t been used in a while. This was purely to reduce SpamSieve’s memory usage; it had no positive effect on accuracy, and often a negative one if people pruned too much, which they tended to do. In more recent versions, SpamSieve is more selective about which messages it trains itself with. This prevents the corpus from growing as fast as it once did, and thus reduces the need for pruning. Since there was no longer much need to prune, and there were many potential downsides if it wasn’t used properly, I removed the feature in SpamSieve 2.3.1. Now the options are to either keep growing a corpus or to reset it (i.e. start from scratch).

Exactly. Since spam is changing, it’s always best to have SpamSieve trained with recent messages. Rather than archiving a set of messages to re-train with, you should simply start saving the spam that you receive a few days ahead of time, so that when you reset the corpus you already have what you need.

It’s mostly a kind of inertia. When there are a lot of messages, SpamSieve will be slower to adapt to changes in spam because each new message that’s added is a smaller fraction of the whole. So in order to counteract previous information that SpamSieve has learned, which was once correct but which no longer holds for current spam, you’ll have to train it with more messages. And then the corpus will be even larger, making it even slower to turn.

Another consideration is that newer versions of SpamSieve can extract more information from the messages, but it can’t do this retroactively, so to get the full benefit you need to re-train it.

This sounds pretty normal to me. By definition, the spam that’s caught is given a score of 50 or higher. And it’s normal for false negatives (uncaught spam) to have scores below 50 but higher than most of the good messages, because while SpamSieve might not have thought that they were spam, it can tell that they are “less good.”

SpamSieve’s spam threshold is fixed at 50; it is not possible to adjust it. You can, however, make SpamSieve more or less aggressive.

However, since your corpus is about a year old, it’s a bit large, and accuracy has dropped, I would instead recommend resetting/re-training. Of course, if you’re happy with 97% you can just leave it as-is.

ophiochos · February 20, 2007, 6:33am

fair enough
thanks for the full reply. I’ll start saving the spam to retrain it…

cortig · April 11, 2007, 8:08am

Yeah that’s always a tough question.
I also have a (much too) large corpus:
Filtered Mail
21,691 Good Messages
8,097 Spam Messages (27%)
24 Spam Messages Per Day

SpamSieve Accuracy
5 False Positives
344 False Negatives (99%)
98.8% Correct

Corpus
1,947 Good Messages
3,134 Spam Messages (62%)
206,318 Total Words

Rules
6,771 Blocklist Rules
11,540 Whitelist Rules

Showing Statistics Since
5/12/06 11:25

but I am hesitating to reset it and re-train since I get a rather high accuracy :-
I’m not sure it’d be worth asking SpamSieve to start training all over again,

Corentin

Michael_Tsai · April 11, 2007, 8:55am

I don’t think it will take long for it to surpass 98.8% when you re-train.

cortig · April 11, 2007, 10:16am

Yeah. I’ll give it a try. Thanks for the advice.

Corentin

bgalbs · June 14, 2007, 6:20am

Massive corpus
Wow, I’m very glad I discovered this thread; check out my stats:

Filtered Mail
26,730 Good Messages
315,621 Spam Messages (92%)
1,210 Spam Messages Per Day

SpamSieve Accuracy
119 False Positives
7,538 False Negatives (98%)
97.8% Correct

Corpus
4,992 Good Messages
13,096 Spam Messages (72%)
349,755 Total Words

Rules
7,743 Blocklist Rules
10,702 Whitelist Rules

Showing Statistics Since
9/26/06 1:36 PM

I’ve been suspecting that SS was slowing my system down considerably, and since removing the Apple Mail bundle, I’ve confirmed that it was – and now I understand why. If some of the previous corpuses (corpis?) in this thread were considered big, I guess mine is simply massive.

Falling back to Apple Mail’s built-in junk mail filtering resulted in fast Mail again, but its just too frickin’ inaccurate. I’ve thought about just using GMail as a spam filter, but I prefer the flexibility of client-side filtering.

I’m looking forward to how SS performs with a re-train…

Michael_Tsai · June 14, 2007, 6:26am

You might also want to go into the Blocklist and Whitelist windows and delete the rules with 0 hits. Please note, however, that a large corpus doesn’t generally slow SpamSieve down much, except when launching. Retraining should improve the accuracy, but if it doesn’t improve the speed please contact spamsieve@c-command.com and I’ll look into what might be causing the slowdown.

bgalbs · June 14, 2007, 7:03am

Speed, and issue with retraining
Michael,

Thanks for the tip – I’ve deleted the 0-hit list rules. As I retrained SpamSieve, trying to keep it under 1000 and 65% spam, I noticed a few things:

My Spam folder contains 140,000 pieces of mail. WOW. And my Envelope file is 105 MB. When I trained spam in the Spam folder, it took forever (I had the stats window in SS open and watched the stats update). But when I moved 650 spams to a temporary folder and trained it in that folder, it was very quick.
At some point, the spam stats stopped updating. I trained ~70 emails but the Spam Messages stat didn’t ever display them (though in Mail they were marked as spam). Not sure why this happened.

My complaint with SS speed was on start-up, so I’m glad it sounds like this is going to result in a performance gain there, and it sounds like by nuking the spam folder, reconfiguring Mail to move SS to Junk and have Mail delete Junk monthly, and rebuilding the Envelope DB I can resolve the rest of my Apple Mail speed issues.

Sweet!

Michael_Tsai · June 14, 2007, 8:40am

Are the numbers in SpamSieve’s Statistics window correct if you quit and re-launch it? If not, SpamSieve’s History.db file may be damaged.

Yes, Mail will be much faster if you delete (or archive) the messages that you don’t need and if you delete the Envelope Index so that Mail can build a fresh, optimized database the next time you launch it.