Not catching all spam that it used to --

Randy_Shulman · January 5, 2008, 2:09pm

I use Spamsieve with Mail and Leopard. Recently, in the past few weeks, it’s been catching most of my Spam, but still leaving a large portion of it in my inbox as well. I have tried resetting the corpus and retraining, and even setting the sensitivity in preferences, but am still ending up with mounds of Spam in my inbox. It used to work flawlessly, but now I’m not sure what to do to get it to return to a near-perfect state of accurate spam catching! Help! (I should note that spam messages to our email address at metroweekly.com seems to have increased by 1,000 or so a day over the past week or so). Thanks.

Michael_Tsai · January 5, 2008, 2:11pm

Before trying such drastic measures, you should see whether SpamSieve is predicting these spam messages to be good. If so many are getting into the inbox, there may simply be a setup problem. Probably 95% of the reports I receive about “SpamSieve not catching spam” have nothing to do with SpamSieve’s accuracy; rather, the problem ended up being that the mail program was not asking SpamSieve to look at those messages.

mrtoner · January 5, 2008, 2:38pm

Garbled Logs
I’m going back in my logs to see why SpamSieve is miscategorizing so much spam, but every log except the current log looks like:

===== Use File->Reload (Cmd-R) to display more. ====
_Ã<¸∏ÁgÍdÚ+“m?Kx3Ø√„ªPìGÇÇ–dΩA

(I’d show a little more than this, but the forum software is cutting off the post.)

mrtoner · January 5, 2008, 2:50pm

An example of a message that is clearly spam:

=====================================================================
Predicted: Good (45)
Subject: JANUARY 80 % 0FF!
From: Rachel.Rogers@unicycling.org
Identifier: thz6w70HRgOZ1CBkZaPcaw==
Reason: P(spam)=0.876[0.910], bias=0.000, R:^84^135(0.998), S:80(0.855), R:^84(0.764), 0100(0.261), to:@donmorris.com(0.739), to:don@(0.692), ^i-semi(0.314), jan(0.326), S:JANUARY(0.658), S:0FF!(0.658), R:^partners3^100mwh^com(0.647), R:^100mwh^com(0.643), 2008(0.383), sat(0.387)
Date: 2008-01-05 14:36:22 -0800

Trained: Good (Auto)
Subject: JANUARY 80 % 0FF!
Identifier: thz6w70HRgOZ1CBkZaPcaw==
Actions: added rule <From (address) Is Equal to "Rachel.Rogers@unicycling.org"> to SpamSieve whitelist, added rule <From (name) Is Equal to “Rogers, Rachel”> to SpamSieve whitelist, added to Good corpus (10770)
Date: 2008-01-05 14:36:22 -0800

Does that offer any clues as to why SS is marking it Good?

Michael_Tsai · January 5, 2008, 7:08pm

The old log files are compressed and cannot be directly viewed in Console.

This indicates that SpamSieve’s Bayesian classifier thought the message was good (barely), based on the occurrences of the words it contained. Since your corpus has 10,770 good messages (more than 10 times as many as recommended), you’d probably see better accuracy by resetting it.

MarcDo · January 6, 2008, 5:37am

Not catching all it used to —
Hi Michael,

I’ve tried to figure out what isn’t working because I am having a similar problem. I reviewed the help files and my brain don’t seem to speak the same language.

I thought my set-up was pretty good, but I’m getting more and more false negatives every day. The corpus seems pretty good and the percentages are pretty good as well. The program worked very well until a few weeks ago.

I’m uploading a comparison of my SpamSieve statistics. They show the overall statistics since I began and the statistics for the last month (five weeks). The trend is getting worse.

I originally thought spammers were getting more sophisticated but when I saw the title of this thread I couldn’t help but add my comments. I have recommended your software a number of times because it surpassed my expectations. Until 6 weeks ago. Help.

marc

Michael_Tsai · January 6, 2008, 8:27am

I can’t tell what’s happening from the screenshots, just that the accuracy has gone down. Whenever you’re having a problem with SpamSieve’s filtering, please send a full report so that I can see exactly what it’s doing.

Randy_Shulman · January 6, 2008, 6:26pm

Still not working right –
I love the program and it used to work flawlessly, so I’m not sure how to fix it. What do you need to see to help determine what’s going wrong?

Michael_Tsai · January 6, 2008, 6:46pm

This page tells how to send in the log file and screenshots.

patchelect · January 8, 2008, 12:29pm

I found a few things happening with spam in the last few days. One is that a large amount of it was getting past Spamsieve and the other is that there’s a large amount of it period.

I found that training the offending emails to be Spam resolved that problem, but the actual amount of Spam that is getting past my Verizon spam filter is probably tow or three times it what it used to be. (The Verizon filter isn’t the greatest, but between it and Spamsieve I get almost zero Spam into my In box.

I’ve also made a point of retraining at the Verizon level and it’s already made a difference.

Perhaps the starts were in the right (wrong) alignment for Spammers lately??

F451 · January 8, 2008, 2:38pm

SpamSieve Lagging Upon Opening
I am having the same issues. Also, I have noted that when I first start Apple Mail that there are times when SpamSieve lags significantly, versus opening immediately thereby allowing spam to bypass SpamSieve; this has become the norm versus the exception. Still love SpamSIeve, but hope that these minor issues get resolved. Thanks.

G5 PowerPC Dual 2GHz
Mac OS X 10.5.1 Build 9B18
Apple Mail Version 3.1 (914/915)

Michael_Tsai · January 8, 2008, 7:54pm

Then you, too, should send in a report if you can’t figure out it. So far, no one here’s done that, so I’m assuming they fixed their setups.

Sounds like you’re running into this Leopard problem. It will be addressed in the next release.

F451 · January 8, 2008, 7:57pm

No crash log as it is a lag on first start. Thanks for the link as I look forward to the next release. Until then, I’ll just have to wait.

mrtoner · January 12, 2008, 6:12pm

I’m pretty sure you meant “thought the message was Good,” since SpamSieve reported “Predicted: Good (45).” In any case, why is 10,000+ Good messages a bad thing, if they only account for 65% of the entire corpus? And if it is a bad thing, why does SpamSieve put that many messages in it (I started with 1,000 total, spam and good)?

Michael_Tsai · January 12, 2008, 6:41pm

Correct.

Because spam (and perhaps your good mail) change. By the time you’ve accumulated 10,000 messages in the corpus, they’re probably not that similar to the current spams. Also, when there are so many messages in the corpus, SpamSieve will be slower to learn because any message that you train it with will be a much smaller fraction of the total.

I don’t know precisely what happened in your case. Messages get added to the corpus in two ways: when you train SpamSieve (e.g. the initial training or to correct a mistake) and auto-training. In older versions of SpamSieve, the auto-training was aggressive and it could lead to very large corpora. More recently, it’s smarter about only auto-training SpamSieve with messages that are particularly interesting or when the good:spam ratio is off.

mrtoner · January 12, 2008, 10:30pm

Thanks, Michael – makes sense. I’ve reset the corpus and expect that SpamSieve will soon be back to filtering with the accuracy it once did.

User_Name · January 20, 2008, 2:55pm

Doesn’t resetting the corpus cause you to loose a lot of hard earned work in that it has to start over again?

My corpus has 3691 messages 103,452 words…

I noticed in the last 10 days os 2-3 will get by SS whereas nothing did until recently?

Michael_Tsai · January 20, 2008, 5:44pm

Yes, although that’s not necessarily bad because it will be more accurate when trained with newer messages. Still, I recommend resetting only when the accuracy is poor and after you’ve checked the log to make sure it’s setup properly.

Michael_Tsai · January 20, 2008, 8:04pm

After looking at Randy’s log file, the main problem was that after resetting the corpus he re-trained it with about 14,000 messages, instead of the recommended 1,000. Thus, he ended up pretty much back where he started, instead of with a fresh corpus tuned to recognize the latest spams.

The accuracy problems due to the huge corpus were made worse by long delays between when the messages were received and when he trained the spam messages in the inbox as spam. Re-training should make this much less of a problem (since it will reduce the overall error rate), but if you’re in a situation where you cannot correct SpamSieve’s mistakes promptly, you should turn off auto-training to minimize the effect of the mistakes.