C-Command Software Forum

Format rules for corpus words?

I’ve noticed that the words in the corpus are formatted with special characters. I’ve figured out that “^” is a replacement for “.” (probably since the latter means “any character” in a regular expression), but I haven’t been able to figure out the rest (e.g. starting with “U:” or “S:”), nor find documentation on this.

What are the special formatting rules for the corpus words? (Adding this to the FAQ, or to the main documentation, would be great…)

Some of the tokens in the corpus are ordinary words. The others use special characters to represent various features of the message. The precise meanings are an internal implementation detail and subject to change at any time. “^” means that the word didn’t litereally appear in the message; rather, SpamSieve calculated the property. If there’s a colon, it means the word occurred in a particular location, e.g. “S:” means in the subject and “U:” means in a URL.

… and I’m guessing a prefix of “A:” means in an HTML “<a…>” tag. However, what’s “F:”, “H:”, “I:”, “L:”, “R:”, “T:” and “Y:”?

And, some two-letter prefixes that may not be special: “LS:”, “MI:”, “MT:”, “RP:”, “RT:”, “TL:”, “XM:”. (Are they special?)

On the larger issue, you make the corpus words visible, and even let people edit the spam/good values (although not the words themselves). I’ll suggest that documenting this, even if it is version-dependent, would help the user understand SpamSieve better, and thus use it better.

There also seem to be invisible control characters in front of some of the words (I saw them by copying into BBEdit). I’ll suggest it would be good if these could be seen by the user.