Type of Spam That Almost Always Gets Through

Austin_Baze · August 20, 2024, 1:59pm

After 10 years of use and careful training SpamSieve has a 98.6 accuracy rate still. But I think I have found its “spam kryptonite”. It appears that grammatical English, standard phrasing and conventionally formatted emails offering my business “credit, lines of credit, financing, or loans” all get through with a score of 27 (or 5, like this one) and are almost always “Predicted Good”.
I have trained a few dozen of these but one or two still gets through every week. Not a big problem just an observation. I did think that repeated training would somehow reveal a pattern in these solicitation emails that annoy me though.

Michael_Tsai · August 20, 2024, 3:54pm

That’s a low accuracy rate for SpamSieve in general, and if you’ve been training for 10 years there’s likely a lot of information that’s now out-of-date. It would probably help to choose Reset Corpus from the File menu, then re-train SpamSieve with a smaller number of recent messages.

Austin_Baze · August 20, 2024, 6:21pm

It’s been right about 98.6% for years. No doubt old stuff in corpus, but I guess I can retrain. I will need to save up some spam for a few days to get to the right training level percentage of 65/35 as I tend to trash spam fast, and had been set to auto-delete any spam I train manually.

Austin_Baze · August 22, 2024, 2:41pm

Well I took the plunge. I had treated my 10-year old corpus like delicate sourdough starter and the thought of resetting/discarding it was scary. I did. Retrained with 65 spam and 35 good messages and I am on my way again! New stats show 99.1% accuracy.

countryman · August 27, 2024, 4:25am

Ah, good old Bayesian. A perfect example of the impact of prior odds (as I now realise)

Austin_Baze · September 9, 2024, 11:00pm

I spoke too soon. Accuracy is falling steadily, now down to 96%. Obvious spam is being labeled Good, and False Negatives are way up since I started with a new corpus and did an initial training with100 current messages 65/35% good vs spam%). Maybe this will get better, but it is currently worse. Many scoring"27" Predicted Good that are obviously spam, several scoring “0” that are as well. My slider is 5 marks from the right, slightly toward “Aggressive”

Michael_Tsai · September 10, 2024, 12:21am

Zero means that either you trained that exact message as good or that the sender is in your Contacts. If you use the Save Diagnostic Report command in the Help menu and send me the report file, as described here, I can investigate further.

Austin_Baze · September 10, 2024, 3:26am

Sent, thanks (along with a couple of log screenshots showing what seem to be errors).

Michael_Tsai · September 10, 2024, 12:44pm

Thanks. It looks like part of the problem is that when you retrained you used 65% good messages instead of 65% spams. That’s OK but will make the filtering less aggressive. Over time, SpamSieve will correct the ratio automatically.

Some of the ones that have score zero did not actually get that score. Rather, the log is showing that because SpamSieve processed the same message twice—an artifact of SpamSieve’s workaround for a bug in Apple Mail.

Austin_Baze · September 10, 2024, 2:21pm

Looking for dummy emoji.
I very carefully let spam messages and good arrive and be stored up before I began the training and literally used 65 messages and 35 messages to do a new initial train to make sure I got the percentage right. And COMPLETELY swapped the two ratios. User error. Can I reset and do a retrain again?
Would turning SpamSieve off for a couple of days then turning it on once I have 65% spam messages and doing a new training work?

Michael_Tsai · September 10, 2024, 3:15pm

Yes.

I don’t see any benefit to turning it off while you’re saving up spams.