C-Command Software Forum

Spam in non-Latin characters

For some reason, I started pulling a lot of Greek-language real estate advertisements and these don’t get filtered. Is there a way to tune Spam Sieve for this purpose?

Also, some spammers still manage to get through repeatedly. One is Microword, although I think it’s about run out of ways to fool SS. Today it got filtered right off.


It should happen automatically, as you train SpamSieve, if your setup is correct.

Michael, here’s a typical case:

Trained: Spam (Manual)
Subject: Τα νέα της εβδομάδας
Identifier: 2vMgbun21lL1I/4205gxqA==
Actions: added to Spam corpus (9131)
Date: 2009-06-01 11:52:04 -0700

Trained: Spam (Manual)
Subject: ΑΝΑΚΟΙΝΩΣΗ ΤΗΣ ΦΙΛΕΛΕΥΘΕΡΗΣ ΣΥΜΜΑΧΙΑΣ - Τηλεοπτικό σποτ της Φιλελεύθερης Συμμαχίας.
Identifier: Mymw8RoGROMdWgBknlSj6w==
Actions: added to Spam corpus (9133)
Date: 2009-06-01 15:29:08 -0700

It showed up after other Greek-character spam got manually tagged.

These are the rare but consistent exceptions to the filtering rule. My setup has been reliably working for most (95%-plus) Latin-character spams.

Does it mean that word for word the items have to be identical or nearly so?


If you’re going to report a specific example, please do so after reading the FAQ. The example is only relevant if SpamSieve thought the message was not spam, and the FAQ explains the recommended way to report such examples.

I do see that your corpus has about 10 times as many spam messages as normal, so that would make it a bit slower to respond to training.

No, just as they don’t have to be identical in English.

Okay, thanks. I don’t want to belabor the situation as overall I’m pleased with Spam Sieve’s performance, but fyi, I’m using Entourage (Office 2004). I tested my Rules and they work fine. I had no idea I had 10X the normal spam messages, but chalk that up to my making my living on the net.

One change I did make was to make Junk Mail anything sent to me with a “.gr” extension in the email address, since Entourage can’t process the Greek characters in a Subject line. I get no legitimate email from Greece, so perhaps this will suffice. I’ll report back. People getting Cyrillic or Mandarin spam might find useful a rule addressing messages from addresses including national TLDs like “.ru” and “.cn” (being careful to exempt actual correspondents).

BTW, the only place I saw the requirement for how to report anomalies is in your initial comments to this forum topic. Is that what you meant by the FAQ?

Thanks. I’m a happy camper, Michael. SS has gotten better with time. I appreciate your prompt personal response, too.

Entourage’s subject limitation should be irrelevant, because hopefully you won’t be making these sort of Entourage rules. If you want messages with certain characters in the subject to be classified as spam, you should do so using SpamSieve’s blocklist. Likewise for e-mail address extensions.

SpamSieve will automatically exempt anyone in your address book or who you’ve received good messages from.

The page that I linked to twice in this thread explains the steps that I recommend people go through. When you get to the part about the log, it tells how to send in a report if necessary.