PDF Spam Filtering

droog · July 13, 2007, 1:27pm

I (like most everyone else) have been hit with a lot of PDF spam recently. SpamSieve doesn’t seem to have the ability to filter this type of spam yet. Any plans to issue an update with this feature? Soon please

Michael_Tsai · July 13, 2007, 1:39pm

I don’t think this requires an update. SpamSieve is catching all the PDF spams for me. If it isn’t for you, there may be something else going on. Please check your setup and/or report the ones that are getting through.

sfraw · July 17, 2007, 4:10pm

No Luck Here
PDF Spam is filling my mailbox and training does not seem to help.
What specific settings might help?

Michael_Tsai · July 17, 2007, 4:28pm

No special settings are necessary, but you need to verify that the messages are actually getting through SpamSieve, and it may be necessary to reset the corpus. But before you reset the corpus, please send me a report.

sharkez · July 18, 2007, 7:09am

pdf spam and unusal “trained as good”
Michael,
I am getting lots of pdf spam too, both with the pdf opened and displayed and also as pdf attachments.

I always (almost) hit cntl-apple-s and the message disappears but it seems many are still being trained as “good” and added to my white list, at least that is what the log.log seems to say. I am enclosing it.

In fact I just had an attachment pdf hit my mailbox while typing this and saw it automatically added to my whitelist (the whitelist was open.) I’m going to remove many of these from the whitelist and see if I can add them to the black list…

Well, it looks like I can only remove them and not move them.

earlchr · July 18, 2007, 7:19am

Same here. I’ve tried training these types of files as SPAM, but for some reason, they seem to be adding themselves to my whitelist, such as the ones below.

=====================================================================
Predicted: Good (25)
Subject: Emailing: alert.pdf
From: cemil@b1b2.com
Identifier: QIiyNZ9bIdzKrMpYJdjKxQ==
Reason: P(spam)=0.000[0.457], bias=0.000, x-mimeole:ProducedByMicrosoftMimeOLEV6.00.2900.3138(0.001), MT:MSHTML6.00.2900.3132(0.001), XM:MicrosoftOutlookExpress6.00.2900.3138(0.001), S:Emailing(0.001), R:^32(0.998), R:^112(0.995), X:HELO-DYNAMIC-DIALIN(0.909), X:INVALID-TZ-GMT(0.889), H:X-MSMail-Priority(0.842), x-msmail-priority:Normal(0.842), handled(0.160), handled(0.160), attachments(0.170), attachments(0.170), viruses(0.191)
Date: 2007-07-18 09:37:15 -0400

Trained: Good (Auto)
Subject: Emailing: alert.pdf
Identifier: QIiyNZ9bIdzKrMpYJdjKxQ==
Actions: added rule <From (address) Is Equal to “cemil@b1b2.com”> to SpamSieve whitelist, added rule <From (name) Is Equal to “cemil novac”> to SpamSieve whitelist, added to Good corpus (1104)
Date: 2007-07-18 09:37:15 -0400

Predicted: Good (25)
Subject: Emailing: Invoice.pdf
From: LEBEDEVxwx@leadbetterteam.com
Identifier: 8UdhNEWDMBUpCOXrhuH2Jg==
Reason: P(spam)=0.000[0.454], bias=0.000, x-mimeole:ProducedByMicrosoftMimeOLEV6.00.2900.3138(0.001), MT:MSHTML6.00.2900.3132(0.001), S:Emailing(0.001), XM:MicrosoftOutlookExpress6.00.2900.3138(0.001), invoice.pdf(0.005), ^ih-944(0.005), ^iw-896(0.005), invoice.pdf(0.005), R:^ono^com(0.995), R:^dyn^user^ono^com(0.995), R:^user^ono^com(0.995), X:RCVD-NUMERIC-HELO(0.895), handled(0.143), handled(0.143), H:X-MSMail-Priority(0.841)
Date: 2007-07-18 09:47:15 -0400

Trained: Good (Auto)
Subject: Emailing: Report-41366.pdf
Identifier: 6AhTovHAh389njylTVuoNg==
Actions: added rule <From (address) Is Equal to “Litwiller@hurryupoffense.com”> to SpamSieve whitelist, added rule <From (name) Is Equal to “bev Litwiller”> to SpamSieve whitelist
Date: 2007-07-18 09:57:15 -0400

Michael_Tsai · July 18, 2007, 9:14am

As I said above, I expect SpamSieve to catch the PDF spams. If this isn’t happening for you, please send me a report so that I can see what’s happening on your Mac.

I did not receive an enclosure. In any case, it is normal for SpamSieve to auto-train the whitelist when it predicts that an incoming message is good. If it turns out that the message wasn’t good, when you use the “Train as Spam” command, SpamSieve will disable (uncheck) the rule that was created on the whitelist and add a rule to the blocklist.

After using “Train as Spam” the whitelist rule should be disabled, and a rule should have been added to the blocklist. This is actually better than if you had removed the rule from the whitelist, because a the presence of the disabled rule prevents SpamSieve from ever auto-adding that rule to the whitelist again. In short, it should never be necessary to manually add or remove rules from the whitelist or blocklist. You only need to do that if you are creating your own custom rules.

Michael_Tsai · July 18, 2007, 9:16am

Again, please e-mail me a report with your log file and false negative files. I need to see what’s going on in order to help you.

That’s probably normal.

sharkez · July 20, 2007, 6:42am

train as spam not changing rule I think
Michael,
Sorry, I’ve not been back for a bit but I don’t think the train as spam is changing the previously created white rule.

I think you didn’t get the attachment because I’m noticing now that the allowed type of attachments here is not .log (or text I guess) even though it looked like it was attached. I’ll zip up the files. I’m sending (to spamsieve-fn@c-command.com) a zip with four false neg pdf attachments and the log file.) Is there any way to send the blocklist and whitelist?

Also, it appears that part of what is going on is just a filter on the address, at least that is what is added to the whitelist. What type of filtering are you doing on the contents?

sharkez · July 20, 2007, 6:50am

actually is unchecking…
Actually, I think it is doing the “removal” since the check mark goes away when I train as spam. So If that is working then why am I still getting lots of pdf spam? Are you only filtering on the address–that would be only marginally useful since spammers use thousands of addresses (even mine!)

So how are you filtering on the content? Does train as spam also make more likely the catching of a pdf with one of the few typical names (advertisement.pdf, report.pdf, message.pdf, etc.)?

How about actually looking at the pdf content itself (could be tough–slow…)?

Michael_Tsai · July 20, 2007, 7:00am

The forum doesn’t allow text or Zip attachments (except from me), and you probably don’t want your log file on the Web, anyway.

Sounds good.

Those are stored in:

~/Library/Application Support/SpamSieve/Rules

but there’s probably no need to send them.

That’s not correct. SpamSieve uses a variety of filters, the most important being the Bayesian classifier.

I don’t know—that’s why I need you to send the log.

Yes.

SpamSieve 2.6.2 does that, and the next version will do so more.

sharkez · July 20, 2007, 7:13am

Just looked at the corpus and I know my examples sent to you this morning included one or two message.pdf attachments. The corpus has S:message.pdf listed with only one Spam count from 7/10/07; there is no straight message.pdf listed.

Michael_Tsai · July 20, 2007, 7:49am

None of the four false negative files that you sent me included “message.pdf” attachments, though they did have PDFs with other names.

The log shows two messages (one from 6/27 and one from 7/11) that had “message.pdf” in the subject (that’s what “S:message.pdf” means), both of which SpamSieve figured out were spam on its own. On 7/15 SpamSieve classified a message with subject “Re:” as spam, partially because it contained an attachment named “message.pdf”.

The log file included some other information that I found useful and that I will use to improve the next version of SpamSieve.

sharkez · July 20, 2007, 8:05am

My mistake, it looks like you are correct Spamsieve found the two with message.pdf and made them spam before I saw them. The two i sent had “complaint.pdf” and magazine.pdf"

It makes me wonder, is it possible to have a rule that filters it out if all that is in the message is a single pdf (no text) not from someone in you address book?

Michael_Tsai · July 20, 2007, 8:30am

You can’t do exactly that, but you could create a blocklist rule that says “Any Attachment Name Ends With .pdf”, and if you have Use Mac OS X Address Book checked it will automatically not affect messages from people in your address book. However, I don’t expect that to be necessary. You can improve SpamSieve’s accuracy by resetting the corpus and re-training it with some more recent messages or simply wait for the next version.

trb · July 27, 2007, 4:04am

distinguish good and bad pdfs?

I got quite a number of PDF spam these days as well. I have not elected to
mark them as junk, because I do receive lots of legitimate pdfs.
Will Spamsieve differentiate between them?

best wishes
T

Michael_Tsai · July 27, 2007, 6:01am

If you expect to get good accuracy it’s essential that you tell SpamSieve the truth. If you get a PDF spam and let SpamSieve think that it’s good, not only will more PDF spams get through, but SpamSieve will also start letting other spam messages through, and possibly even think that some good messages are spam.

Yes, SpamSieve can learn the difference between spammy PDFs and good ones.

Michael_Tsai · August 3, 2007, 11:52am

SpamSieve 2.6.3 is much better at analyzing messages containing PDF attachments. Some improvements will take effect automatically, but to get the full benefit you’ll need to reset SpamSieve’s corpus and re-train it. I would only bother doing that if you find, several days after updating to 2.6.3, that PDF spams are still getting through. That won’t be the case for most users.