Is it ever worth resetting the corpus (due to size of previous corpus)

sparker · August 31, 2016, 7:49pm

I’ve been a LONG time user (since 2008) and my Corpus appears to be getting fairly large (see attached PNG), and I’m ending up with lots of “Mail thinks this message is Junk” messages in my inbox (not moved to the Spam folder). I read somewhere not to fix them with the apple buttons, but to use “Train as Good” button for SpamSieve. I’ve been doing this for a couple months, and it does fix the brown “Mail thinks this is junk” bar, but I keep seeing similar messages appear as “junk”.

I was wondering if I should reset the corpus and start retraining SpamSieve from scratch.

Thoughts or Opinions are welcome.

Steve

Screen Shot 2016-08-31 at 8.37.38 PM.png

Michael_Tsai · September 1, 2016, 7:52am

Yes, you should basically ignore everything having to do with “Junk” in Mail: the icon, brown bar, and the marking commands. What matters is whether the message is in your inbox or Spam mailbox, and then you can correct using the SpamSieve training commands if necessary.

I don’t have enough information at the moment to know whether resetting the corpus is necessary or would help at all in this situation. If you can send me your SpamSieve log file, I’ll take a closer look.

sparker · September 3, 2016, 7:12pm

Sent… And thanks for looking into this!

Steve

Michael_Tsai · September 5, 2016, 7:19am

Thanks.

The first thing I noticed is that you have a bunch of rules in Mail (besides the SpamSieve one) to catch spam messages. I normally recommend not doing this, however I think in your case it’s probably OK because they are above the SpamSieve rule. However, it looks like those rules change the message background colors, and you are using the (SpamSieve Manual: Spam Message Colors in Apple Mail), so there could potentially be a conflict. Since you actually have the SpamSieve rules set up to put all the spam in the same place, I recommend returning to the standard setup of only having one SpamSieve rule. The normal rule is not affected by other coloring rules.

Of the recent messages that you trained as spam:

Some were not processed by SpamSieve at all. So this has nothing to do with the corpus. It’s likely due to a user error (e.g. marking the message as read on your iPhone) or setup problem.
There were a bunch of marketing messages (from Slickdeals, Home Depot, TigerDirect, StackCommerce) that look to me like they were not actually spam. And you had previously treated messages from those senders as good. Because of this contradictory training, I think you would eventually benefit from resetting the corpus. However, I think it’s important to first figure out the issue from #1 as well as the good messages issue (below).
There were a bunch of spams that SpamSieve had missed.

I also saw lots of messages (for example, one from United) that you trained as good even though SpamSieve had already predicted them to be good. Did you do this on purpose? Or did you train them as good because they were in your Spam mailbox (e.g. because of another rule)?