There are times when I think SpamSieve is being smarter than I want it to be.
For example, I don’t want email names added to my whitelist. First because I get massive named blasts from my office — I want to just create one rule (
ends with @roadscholar.org
) and have it go through. I don’t want a separate rule for every email and every name in the company who may have been on an email once.
Similarly on personal correspondence, a rule for yournamehere@gmail.com makes sense to me, but I don’t want to have a separate rule for Jobob McEmailer as a name rule.
Is there a way to get the automated system to only build rules for addresses and not names?
Also, is there a way to not create a new rule if one already applies? (I know that makes weighting more difficult)
Lastly, what is the encoding of the database? I’ve seen a few cases where unicode names are stored incorrectly. Is that a case of an issue inside the app or the MUA?
What problem are you trying to solve here? SpamSieve does this because it’s safer. It can help protect you in the event that someone changes their address or e-mails you from a different address. It also protects you if you get a spam messages forged from one of these people; training it as spam will still leave the other people whitelisted. Lastly, multiple rules are faster than using an “ends with” rule.
Why not? Are you likely to get spams forged from that name? Or is this an aesthetic preference when looking at the Whitelist window?
I don’t recommend this, but you can click here to use a hidden setting to turn that off.
No. As described above, I’ve found that having more granular rules is often beneficial.
SpamSieve uses Unicode internally. It could be that the names in the messages did not specify an encoding or did not actually use the encoding that they specified. If you can find the messages in question, I could investigate further to make sure it’s problem with the messages rather than SpamSieve’s parser.
SpamSieve uses Unicode internally. It could be that the names in the messages did not specify an encoding or did not actually use the encoding that they specified. If you can find the messages in question, I could investigate further to make sure it’s problem with the messages rather than SpamSieve’s parser.
See ‘should be apostrophe.png’. I associate ‘a-with-accent’ to Unicode (specifically UTF-8) being cast into ISO-1. I’m fairly certain you’ll say this is a problem with the email.
What problem are you trying to solve here?
and
Or is this an aesthetic preference when looking at the Whitelist window?
Certainly it makes it easier to human read and curate as a rule. If these were regular expressions, I’d certainly understand a possible performance implication, but anecdotally, comparing one substring versus more than a hundred string comparisons seems like it’d fall on the substring side. Currently, there are 6605 matches to my created domain rule. That’s the strongest argument, clearly.
Also, see the ‘Names as addresses.png’ attachment for one part of my problem. I can’t do much about email addresses being sent as names, but having names as email fields is just… frustrating.
If you would like to send in the raw .eml file for this message, I can take a look. My guess is that this message contains unencoded 8-bit characters (which is not legal) and that they are in the UTF-8 encoding (whereas SpamSieve defaults to ISO Latin 1 if no encoding is specified). It’s possible that a future version of SpamSieve could let you change the default. But that is not straightforward to do because it would invalidate all the training that had been done using the old default. I understand that you want to see the text displayed as intended, but I place that at a lower priority than filtering accuracy…
I would rather provide better curation tools (e.g. let you easily see the rules you manually created vs. the auto-created ones) than throw away potentially useful data (by collapsing information about all the addresses into a single rule).
Yes, but that’s not what it’s doing. The rule text can be preprocessed so that for Is Equal To matches it only has to do a hash or binary search lookup against all the rules simultaneously. Whereas, if you have lots of Ends With rules they would (at least with the current code) have to all be evaluated separately.
Could you explain what problem this is part of? To me, it doesn’t seem to relate to what you mentioned above aside from the aesthetics of browsing the whitelist. What are you suggesting should happen here?