C-Command Software Forum

Russian spam

Is there any way for SpamSieve to spot Cyrillic alphabet and exclude all the spam I get from Russia?

If the messages all use a certain character set (which you can see by looking at the their headers) you could create a SpamSieve blocklist rule for that.

You could also create SpamSieve blocklist rules for particular Cyrillic characters.

Of course, SpamSieve will automatically learn that Cyrillic words are spammy when you train it with the those messages, so the above techniques shouldn’t be necessary.

Matching by character set doesn’t work
I, too, am having this problem but matching by character set doesn’t work. It’s possibly because there’s no information about charset in the spam in that each header (From, Subject, etc) is prefixed by this: ?koi8-r? instead of having a character set header and the actual contents are HTML-based.

I’ve even tried adding a rule in the corpus that reads "Body (any text part) Contains charset=“koi8-r” but it hasn’t picked up any spam of this type. However, they continue to come in. Would appreciate any pointers here.

The first couple of dozen lines of the raw spam looks like this:

From yitzhak366yates@moebelheinrich.de Tue Feb 26 09:10:09 2008
Lines: 351
Return-Path: <yitzhak366yates@moebelheinrich.de>
X-Spam-Checker-Version: SpamAssassin 3.2.3 (2007-08-08) on
<deleted for privacy?
X-Spam-Level: ****
X-Spam-Status: No, score=4.7 required=5.0 tests=EXTRA_MPART_TYPE,HTML_MESSAGE,
T_TVD_FW_GRAPHIC_ID1 autolearn=disabled version=3.2.3
X-Original-To: <deleted for privacy>
Delivered-To: <deleted for privacy>
Received: <deleted for privacy>
Received: from ppp-30.net-123.static.magiconline.fr (ppp-30.net-123.static.magiconline.fr [])
by <deleted for privacy>; Tue, 26 Feb 2008 09:09:58 -0500 (EST)
Message-ID: <000d01c87880$026e1b4a$33fd20af@cfvjb>
From: =?koi8-r?B?7eHr8+nt?= <yitzhak366yates@moebelheinrich.de>
To: <deleted for privacy>
Subject: =?koi8-r?B?y8/U1MXE1s7ZyiDQz9PFzM/LIMLJ2s7F0y3LzMHT08Eg?=
Date: Tue, 26 Feb 2008 12:20:07 +0000
MIME-Version: 1.0
Content-Type: multipart/related;
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2900.3138
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3198
X-Virus-Status: No

This is a multi-part message in MIME format.

Content-Type: multipart/alternative;

Content-Type: text/plain;
Content-Transfer-Encoding: quoted-printable


=F0=D2=CF=C4=C1=D6=C1 =D4=C1=D5=CE=C8=C1=D5=D3=CF=D7 =C9 =

Please send me your log and false negative files.

Okay, I just sent you a copy of the spam (false negative) in my previous, edited post.

I noticed that when I used the “Train as Spam” plug-ins to Tiger Mail, all it does is add a rule for the “From” header which is worthless because spammers rarely use the same “From” header twice.

Looking at this paste, I would expect an “Any Character Set” rule to work, because of:

Content-Type: text/plain; charset=“koi8-r”

I would not expect a “Body (any text part)” rule to work, since the charset does not appear in the body.

To tell any more, I really do need you to send me the false negative file(s), as mentioned in my previous post.

Sometimes blocking by From is useful, other times not. It should also be adding the message to SpamSieve’s corpus.

Interestingly, with the rule of “Any Character Set is Equal to koi8-r” in the blocklist, the number of hits was 0 despite a couple of spams of this type getting through. However, when I changed it to “Any Character Set Contains koi8-r”, it spiked to 155 and the spams began to get filtered as expected!

To tell any more, I really do need you to send me the false negative file(s), as mentioned in my previous post.

Unfortunately, I didn’t save the false negative file. I’ve only just activated this in the preferences. If the filtering that seems to be working drops off, I’ll send you the contents of the false negative file(s). I have the “Train as Spam” scripts installed in my Mail.app on Tiger.

Cool. When a blocklist rule does match, the log will show the text that matched. So this should let you see what SpamSieve thinks the character set is, and why “contains” worked but “is equal to” didn’t.