C-Command Software Forum

Out With The Old...

Once or twice a year I open my corpus, sort by Last Used date and delete the oldest entries. It seems to me that, if a word has not been encountered in over a year, it’s not really needed.

Is there a better way to keep the corpus current?

Also, here are current stats:

Filtered Mail
167,106 Good Messages
6,607 Spam Messages (4%)
7 Spam Messages Per Day

SpamSieve Accuracy
185 False Positives
116 False Negatives (39%)
99.8% Correct

Corpus
8,221 Good Messages
7,622 Spam Messages (48%)
65,498 Total Words

Rules
5,046 Blocklist Rules
9,107 Whitelist Rules

Showing Statistics Since
10/15/06 10:34 AM

Given these statistics, is there any reason to adjust anything? The training instructions say to use 1000 messages maximum, and suggests 65% spam. Is there a recommended way to bring my corpus into compliance with those guidelines or do they only apply to initial training?

Should I just leave things as they are?

Probably not, although there’s also probably little benefit in deleting it.

I don’t think so. Maybe clean out some rules with zero hits if you feel like it.

It only applies to the initial training.