Teach SS the concept of a disguised URL

mattn · February 21, 2025, 10:59pm

Call me simple-minded, but it seems to me that the most obvious characteristic of a phishing spam is a link that poses as an apparent URL that hides a totally different URL — for example, it says www.intuit.com but in fact it goes to some t.com URL. I’ve noticed that SpamSieve quite often misses these. Isn’t there some way to just jump past all the Bayesian probability theory and teach SpamSieve this simple rule? I mean, gosh, even my 95-year-old mother understands this one. (Well, she sort of does.)

Michael_Tsai · February 22, 2025, 3:38am

If SpamSieve is often missing these, that probably points to another problem. So I would be interested to see the log entries for these false negatives.

My gut says that the simple rule would probably lead to lots of false positives, but I agree that it’s worth trying it out and running it against a large corpus of messages to see.

mattn · February 22, 2025, 2:36pm

I dragged out of the log and sent four recent false negatives that seemed to me particularly obvious spam.

Michael_Tsai · February 22, 2025, 3:55pm

I don’t think we received anything from you. Where did you send them? Mostly I’m interested in the log entries or diagnostic report, though sometimes the messages themselves have useful info.

mattn · February 22, 2025, 5:41pm

Hmm. Sent to spamsieve@c-command.com. I just included the messages from the log. I don’t know what the difference is between that and “log entries”; the log entries are messages.

Michael_Tsai · February 23, 2025, 12:01am

No, I haven’t received anything from you. Maybe an outgoing mail server thought they were spam? Could you send a diagnostic report or PM them in the forum?

The log entries are what you get when you Edit ‣ Copy from the table in the Log window. They show how SpamSieve analyzed the messages.

mattn · February 23, 2025, 2:12am

I would recommend adding a contextual menu to the Log window. It would never in a million years have occurred to me to go to the Edit menu and say Copy! I find that both opaque and clumsy. If something is copyable, I expect a contextual menu to tell me (and to do it). Contextual menu is how to say, “what can I do here?”

I don’t mind including the messages and logs right here.

false negatives.zip (39.1 KB)

You’ll notice that two of them got whitelisted. But wrongly! For example, yes, I have trained some Best Buy emails as good. But this one isn’t really from Best Buy! And the one that is whitelisted because it comes from “Mail Delivery System” is the same, it’s just a spoof and contains evil links. So maybe part of the problem is that SpamSieve is not being suspicious enough about the From of an email message! That is one of the most easily forged things about an email. A “known” From is no reason for SpamSieve not to examine the message itself and see if anything is amiss.

==============================[Flagged]==============================
🟠Predicted: Good (27) [Mistake: False Negative]
Subject: You have a pending payment on your account.
From: "Intuit QuickBooks" <work@secure.net>
To: tidbits@apeth.net
Identifier: PG8/DHn0OUi9aprfLPirkQ==
Mailbox: Tidbits ‣ INBOX
Source: com.c-command.spamsieve.apple-mail.script.filter-inboxes
Reason: P(spam)=0.000[0.493], bias=0.429, R:^140(1.000), S:pending(1.000), F:Books(0.001), F:Quick(0.999), V:protect(0.002), V:hold(0.004), V:We'd(0.009), RP:VM(0.017), V:clarify(0.017), V:pending(0.017), V:placed(0.017), V:like(0.160), MT:MSHTML11.00.10570.1001(0.828), V:some(0.186), V:account(0.210)
Date: 2025-02-21 14:53:05 -0800 (PST)

==============================[Flagged]==============================
🟠Predicted: Good (1) [Mistake: False Negative]
Subject: Thanks for your order.
From: "Best Buy Notifications" <bestbuyinfo@emailinfo.bestbuy.com>
To: tidbits@apeth.net
Identifier: b980kfQI4L1rdq0450oY9g==
Mailbox: Tidbits ‣ INBOX
Source: com.c-command.spamsieve.apple-mail.script.filter-inboxes
Reason: “Best Buy Notifications” matched rule <fromName exact "Best Buy Notifications"> in allowlist
Date: 2025-02-20 09:09:17 -0800 (PST)

==============================[Flagged]==============================
🟠Predicted: Good (27) [Mistake: False Negative]
Subject: Ԝе'vе kոoԝո еасh οthеr fоr а ԝhіlе, аt lеаѕt Ӏ kոοԝ you.
From: "Security Alert." <pzojknjc@stnet.es>
To: tidbits@apeth.net
Identifier: ty/RXYWbHSoxothmow92+g==
Mailbox: Tidbits ‣ INBOX
Source: com.c-command.spamsieve.apple-mail.script.filter-inboxes
Reason: P(spam)=0.004[0.498], bias=0.429, CT:58c(0.999), to:@outlook.com(0.999), CT:058(0.999), CT:62c(0.017), CT:6df(0.017), CT:74c(0.017), CT:812b(0.017), H:X-Vade-Verdict(0.017), RP:VK(0.017), hok(0.017), hok(0.017), to:me@(0.017), F:Alert(0.898), bitcoin(0.868), bitcoin(0.868)
Date: 2025-02-19 01:03:52 -0800 (PST)

==============================[Flagged]==============================
🟠Predicted: Good (1) [Mistake: False Negative]
Subject: Undeliverable: Outgoing mail failed
From: "Mail Delivery system" <mailer-daemon@host2.i4dots.com>
To: tidbits@apeth.net
Identifier: pNnj5ljobvwl/Nxs27k1jw==
Mailbox: Tidbits ‣ INBOX
Source: com.c-command.spamsieve.apple-mail.script.filter-inboxes
Reason: “Mail Delivery system” matched rule <fromName exact "Mail Delivery System"> in allowlist
Date: 2025-02-18 18:10:01 -0800 (PST)

Michael_Tsai · February 23, 2025, 5:09pm

Great suggestion, thanks.

Well, that’s what the allowlist does. It trusts the message’s From and guarantees that any message (claiming to be) from a name or address on your allowlist will get through. It errs on the side of safety. If a name or address turns out to be spoofed, when you train the message as spam, SpamSieve will disable the allowlist rule (unless you locked it) so it doesn’t cause problems in the future. Or, if you don’t want to trust message senders at all, maybe you would prefer to turn off Use SpamSieve allowlist in the settings. Then it will examine the whole message.

Two of the false negatives got though because of the allowlist. One seems to point to an issue with SpamSieve’s detection of invisible text; I’ll look into that. The other was confused by your server junk filter’s thinking the message was good (it should be ignoring that) and by some text in the MIME structure (already slated for revision in one of the next updates).

mattn · February 24, 2025, 2:09am

All of that sounds great. I might even turn off the allow list, though I’m not sure I want to go that far.

Well, that’s what the allowlist does. It trusts the message’s From and guarantees that any message (claiming to be) from a name or address on your allowlist will get through. It errs on the side of safety

But it doesn’t err “on the side” of safety; it just falls completely into the spammer’s trap. That seems a bit extreme.

Maybe SpamSieve might evolve to be more sophisticated about what the allowlist does? Like maybe, instead of just trusting the message’s From, it could examine the whole message anyway and then if it thinks it might be spam, it could then use the fact that this is on the allowlist to weight us further towards declaring the message good — but if it seriously looks like spam, then that would outweigh the fact that it’s on the allowlist. Does this sort of combinatory idea make any sense?

Michael_Tsai · February 24, 2025, 3:15am

There’s also a middle ground where you can turn off Train SpamSieve allowlist. Then it will only use the allowlist rules that you make yourself (or that are already there). So you’d be less likely to have rules that match spoofed messages.

The idea is to provide certainty, to override the fuzzy logic when you know what you want. If you put something on the allowlist you will get the matching messages in your inbox. Period. If you can’t rely on it, what would be the point?

That’s pretty much what it already does when you turn off the allowlist.

mattn · February 24, 2025, 2:37pm

Yeah, I see that I can be much more proactive about the allowlist and what it contains, and that I haven’t been doing that. I’m going to study the manual on this point. Thanks!

Michael_Tsai · March 3, 2025, 6:32pm

I moved this post back to the original topic because the other topic is about spam messages from the user’s own server and scripts to create Reply-To rules.

What makes you say that they’re messed up? As soon as a rule is falsified, SpamSieve will disable it. So the only way the rule lists should get messed up and cause future problems is if you aren’t correcting the mistakes.

And this is only relevant if you happen to have previously received a good message from the same (purported) sender as one that a spammer tried to forge.

Maybe I should make a statistic for this. But one way to investigate it is to look at the Allowlist window sorted by Incorrect. The rules with incorrect hits should be disabled because you would have trained a message as spam. And you can look at these and see how many of them are names/addresses that ever actually sent you good mail vs. ones that SpamSieve auto-trained because the Bayesian classifier thought the message was good. I find that there are very few in the first category.

It’s worked this way for more than 20 years, and I have seen a ton of evidence that the default behavior works well. Amongst other reasons, it addresses two extremely common cases:

The user has a history of receiving good messages from a certain person and expects them to never go to Junk no matter what the person writes or forwards. It’s extremely unlikely that a spammer will happen to forge the name or address of your friend.
A spammer with a distinct name or address sends a series of messages, and after training the first, no matter the content of the messages, you want to be sure that messages 2…n go to the Trash.

The first thing to be aware of is that most of the rules come from auto-training (you receive a message, SpamSieve classifies it as good, so it tentatively adds the sender to the allowlist) not when the user explicitly trains a message. I think you would probably agree that users don’t want prompts coming up throughout the day that are not in response to manual action.

In the case of manual training, I would like to hear some examples of when this would happen and how you would decide. I really don’t understand what you are trying to protect against. For example, in the case of a false positive:

SpamSieve already thought that the good message was spam because of its content.
You’re training the message as good, but you want to specifically tell SpamSieve not to trust that future messages come from that sender? It should classify them based on content alone, even though that didn’t work the first time?
Because you are determining that this particular sender is likely to be forged in the future?

I wonder whether I’m missing something here or you have an entirely different story in mind. I can tell you that from before the rules feature existed one of the most frequent pieces of feedback I got was: SpamSieve made a mistake and I trained the message and then it made the same mistake again? How come it didn’t immediately learn that that sender was good/bad after I specifically told it?

mattn · March 4, 2025, 2:04am

That’s okay, I’ve deleted my comment. It’s not worth worrying about.

Michael_Tsai · March 12, 2025, 8:08pm

I think this is fixed in SpamSieve 3.1.2b1.