moving and training are different?

paulingraham · November 26, 2007, 7:57am

Could someone please confirm a basic concept for me?

With auto-training off … just because SpamSieve’s rule is active and is moving messages into a spam folder doesn’t mean that it’s being trained. Right? SpamSieve will move suspected spam without actually adding anything to the corpus? To train, to add to the corpus, I have to select messages and “train as good/spam”?

Do I have this straight?

Thanks in advance.

Michael_Tsai · November 26, 2007, 8:23am

Correct. Generally you should only train it with messages that it didn’t automatically move to the correct mailbox.

paulingraham · November 26, 2007, 8:50am

Thanks for the rapid response, Michael. Obviously you make a point of being a presence on your forums. I wish all developers were so conscientious!

You surprised me with your reply, though. While I figured you (or someone) would confirm the basic concept I was asking about, I have been training it with all messages, because I assumed that if it wasn’t set to train automatically, then I needed to manually say, “Yes, indeedy, all this spammy messages in the Spam Folder are indeed spam.”

So let me do some more confirming …

I should only “Train as Spam” when I’ve got spam that SpamSieve failed to move to my spam folder?

I should only “Train as Good” when I’ve got good mail that SpamSieve incorrectly moved to the spam folder?

Shouldn’t I “Train as Good” to improve the ratio of good mail to spam in my corpus? Right now my corpus only has about 14% good mail to work with, but it will take quite a long time to increase that if I only “Train as Good” when I get a false negative!

Given that I was training one way or the other with every message … um, should I start over?

That’s a lot of questions, but hopefully easy to answer! Thanks again.

Michael_Tsai · November 26, 2007, 9:05am

In all cases you should start out by doing an initial training with the recommended numbers of spam and good messages. After that, I recommend training SpamSieve with any messages that it didn’t classify correctly, and only with those messages.

Additionally, I recommend that most people use auto-training (it’s on by default), although there are a few cases where you wouldn’t want to do that.

Correct.

If you follow the guidelines above, you probably won’t have to worry about the ratio. You’d start out with the proper ratio, and SpamSieve’s auto-training would maintain it. If you had auto-training off and were only training it with mistakes, there wouldn’t be many of them, so the ratio would stay pretty much the same.

Yes.

paulingraham · November 26, 2007, 9:17am

I understand. Thank you very much for clarifying.

For the record (and possibly other confused users), this is how I got confused and onto the wrong track.

You see, I had a Mail.app kerblooie just a couple days ago and decided I needed to start fresh with a completely new mail folder (which did indeed help my many Mail problems).

But that also meant that I didn’t have a good range of messages to train SpamSieve with. I already had an adequate supply of weekend spam… but I don’t get all that much legit mail on the weekend. Getting the right ratio would have involved a very small sample of data. I decided I’d try to manually manage the ratio as new mail came in, and … well, got kinda confused.

However, good mail is starting to come in now that it’s Monday, so I should have an adequate sample to do initial training with soon.

Thanks again.