C-Command Software Forum

Corpus too large

Been using this fine product Since Jan 04. For grins I looked at a Training Tip and I’m told the Corpus is too large (I’m told that’s OK if accuracy is OK). From the beginning of using the product, the stats are pretty good 99.1% but in the last few months, a lot of spam email from .Mac is coming in and needing to be selected and taught to SpamSieve. I’m wondering if I should take the advise and pare it down or not. Stat’s are below.

Filtered Mail
64956 Good Messages
140997 Spam Messages (68%)
150 Spam Messages Per Day

SpamSieve Accuracy
458 False Positives
1382 False Negatives (75%)
99.1% Correct

19788 Good Messages
85270 Spam Messages (81%)
897400 Total Words

72105 Blocklist Rules
5585 Whitelist Rules

Showing Statistics Since
1/29/04 11:37 AM

Yes, if you’ve been using it for a long time and the recent accuracy is not as good as in the past, then it’s time to reset the corpus. Spam has changed since 2004, so all that old data is probably holding it back and making it slower to adapt. After resetting the corpus, re-train it with a smaller number of recent messages.

What exactly is ‘corpus’?

The corpus is a collection of messages, both spam and good, with which you have trained SpamSieve. SpamSieve uses the corpus to evaluate the contents of incoming messages to determine whether they’re spam. Please see: this page about the Show Corpus command and this page about training SpamSieve.

Just thought i’d point out that you really shouldn’t use an email address as your username, it’s bound to get farmed and then added to a spaming list.

Corpus too big

I’m receiving quite a lot of e-mails and my corpus gets too big very fast.
There is not “automatic” option in SpamSieve to reduce it when needed ?
it is a pity to loose everything each time you have to erase the corpus and restart training and all painfull tasks;…

How many e-mails do you receive per day, and how many do you have in the corpus right now? SpamSieve’s auto-training should prevent the corpus from growing too large very fast. So the corpus should grow mainly when you train SpamSieve with messages that it put in the wrong mailbox, which is probably not enough to make it grow too fast.

No, because there’s no simple way to reduce it in way that provides accuracy benefits. It’s better to start out with a reasonably sized corpus and then control the growth.