Share training data between multiple Macs

NikolausDemmel · January 13, 2025, 6:26pm

In a sense this is a follow-up from my previous question, where I describe my SpamSieve setup with two macs running SpamSieve at the same time:

SpamSieve "Filter Messages" Shortcut not working

The reason is that I’m running SpamSieve on two machines. For one it’s running on my daily driver laptop. But since this is not always running, and also doesn’t always have an internet connection, I’m also running SpamSieve on my old macbook, which sits at home. So some messages in the Junk mailbox have been moved there by one SpamSieve instance and some by the other. But if there is a misclassification, I usually only “Train as Spam” for the SpamSieve on my main laptop, which means the SpamSieve database on the old laptop gets outdated. So from time to time I’m logging into the old machine, going through the spam folder. I then select the spam messages that were not filtered on that machine, run “Filter Messages” and train the ones not classified as spam. (Oh, and if I notice a wrongly classified message in the Junk mailbox, I try to “Train as Good” from the machine that put it there.)

The support page on running SpamSieve on multiple macs suggests 4 options, none of which are very satisfying for my use case:

Run SpamSieve on a single Mac and let it clean your inbox for all the Macs. All the training is done from that Mac. This is the simplest solution. It works well when the Mac with SpamSieve will be running most of the time, and when you can easily access that Mac to do the training.

Doesn’t work for me. My old Mac is slow and unreliable, sometimes shutting down for no reason such that I have to reboot it when I’m home. So I also want SpamSieve to run on my (new) daily driver Mac, so I get the best / quickest Spam filtering experience when working. But I also don’t want to only run on the new Mac, since when I’m on the go and the new Mac isn’t running (or doesn’t have internet connection) I don’t want to see Spam on my phone.

Run SpamSieve on a single Mac using the drone setup. This setup works well when the Mac with SpamSieve will be running most of the time. When you’re away from that Mac, you can remotely train SpamSieve from any Mac, iOS device, or even via Web mail.

Doesn’t work for me for the same reason as above. Also, it would lose the convenience of interacting with SpamSieve, e.g. the shortcuts to train as Good or Bad.

Run SpamSieve on all the Macs and uncheck the Auto-train as needed preference (on all the Macs). You can train whichever Mac you happen to be using at the moment. This will have lower filtering accuracy than (1) or (2) but is useful in situations when you do not have a single Mac that is always available for mail filtering. With auto-training off, you may find it especially helpful to enable allowlisting of previous recipients.

This is somewhat my setup, but it isn’t ideal, as it’s somewhat cumbersome to train the correct Mac when filtering isn’t working. I keep having to connect to the older Mac to train it when it makes a mistake. Having to turn off auto-training is also a clear downside.

Run SpamSieve on all the Macs, being careful to only let one copy of SpamSieve run at a time, and to always correct all the mistakes before switching to another Mac. This will give better filtering accuracy than (3) but is a lot more work.

This would be even more effort to coordinate in my case.

What I would really want is for the multiple SpamSieve instances to share the same training data. Then I could just use my new laptop to train SpamSieve, and when that is turned off, the old one would still filter out messages based on the same training data. I could correct mistakes only on the new laptop and would never have to log into the old one to keep it’s training data current.

In my particular setup, I would be perfectly happy with a one-way sync, where only one machine acts as a “primary” instance, and all other sync’d “secondary” instances only receive updates to the training data via sync’ing. The secondary instances would not update the training data themselves, i.e. you’d have to correct mistakes on the primary instance only.

For sync’ing data, iCloud would make sense, or else any other file-based sync’ing software such as Dropbox.

Do you think adding such a feature in the future would be possible?

Michael_Tsai · January 13, 2025, 6:59pm

It sounds like this would be the best for you. The point of turning off auto-training is so that you don’t have to train the “correct” Mac.

You can do this if you want. That’s what this part of the help is talking about:

When upgrading to a new Mac or using setup (3) or (4) above, you can copy SpamSieve’s training data from one Mac to another. This is only recommended if the two Macs will be filtering the same mail account. Macs filtering different people’s mail should be trained separately for the best filtering accuracy.

Do not copy SpamSieve’s files using a file synchronization program or cloud syncing utility such as Dropbox while the SpamSieve application is running. Doing so can corrupt the files.

I don’t plan to make something like that a built-in feature. I see it as too complicated and inefficient for most users. So I’m working on true two-way syncing, which I think is the way most people would expect it to work. Of course, that is easier said than done given the amount of data that SpamSieve works with.

NikolausDemmel · January 13, 2025, 7:32pm

Thanks for your reply!

Yeah, but in particular the old Mac does require to be trained relatively frequently, as it tends to misclassify from time to time. I would expect auto-training to improve this and make manual training necessary less frequently.

Yes, but I can’t sync this continuously while SpamSieve is running. However, maybe it would be a good idea to copy the training data from the new mac to the old one from time to time, to manually sync them. This way I could mostly focus on correcting errors on the new mac and the old one would get updated from time to time. Not ideal, but maybe better than what I’m doing now. I do feel the old Mac’s training data is considerably worse than the new one. I’ll start with a one-time copy.

The main reason I suggested automatic one-way syncing ewas thinking that it might be considerably less complicated to implement. I’ll be looking forward to two-way syncing. This would work perfectly for my use case as well of course. But I understand it’s not a simple feature.

Michael_Tsai · January 13, 2025, 8:22pm

Yes, I think if the “secondary” Mac starts out with a copy of the good data, it will do well for quite a while without needing a lot of training. If you have auto-training off.

In theory it would be, but there’s a long history of apps trying to build sync on top of file-based syncing services, and it doesn’t seem to end well. So basically this is a judgement call that this sort of half-solution would not be as simple as it seems and, even when done, would leave a lot of people unhappy and/or needing lots of technical support. So I think it’s better to dedicate the time to a “proper” solution.

NikolausDemmel · January 13, 2025, 10:04pm

This sounds very reasonable.

As always, thanks a lot for the helpful tips.

NikolausDemmel · January 13, 2025, 10:57pm

Actually, I have one more question: What makes sense to me is that I should turn off auto-training on the old Mac in my setup, since I’m not there all the time correcting mistakes. However, I would like to turn on auto-training on the new Mac, which I use daily and where I can correct most of the mistakes.

However, I found one behaviour that stops me from turning it on on the new Mac: I have a “Rescue Good Messages” rule enabled, since sometimes my provider’s server-side spam filtering put’s good mail in Spam; and I’d like to keep it on, since overall it does a reasonable job for the worst Spam, which helps that I don’t see it on the phone when I’m on the go. It’s only enabled on the new Mac.

Now what can happen with the two Mac setup is that a Spam message is classified correctly by the old Mac, but wrongly by the new Mac. So the old Mac puts it in Junk and the new Mac’s “rescue” rule put’s it back in the inbox. This keeps happening over and over for the same message until I notice and step in by e.g. marking the message as read. With auto-training enabled, you can see in the Log that the score on the new Mac, which might have started out as “just barely not spam”, keeps getting lower and lower until it’s eventually 0. I can’t imagine that this is ideal for the overall health of the corpus.

Is there a setting to stop SpamSieve from auto-processing the same message twice (e.g. in the “Rescue good” rule)?

(Maybe now that I’ve manually synchronized the training data on both instances, this is less likely to happen.)

Michael_Tsai · January 13, 2025, 11:18pm

Are you sure it doesn’t go directly to zero because it auto-trained the message and then finds it in the corpus? SpamSieve will see that it’s the same message and only train it once, so there’s only one corpus entry that needs to be eventually corrected.

Not exactly, but you could use the TrainSpam mailbox to train the new Mac from the old one.

NikolausDemmel · January 14, 2025, 12:04am

You are right. I could have sworn I’ve seen the gradual decrease in score. But I looked for several examples int the Log and it jumps to 0 after the auto train as you say, and also only auto-trains once.

So provided that I correct the mistakes eventually, having auto-train enabled on the new mac with this occasional flip-flopping should not be harmful. I’ll try enabling again and see how it goes.

NikolausDemmel · January 31, 2025, 10:17am

I can report that overall this has been working quite well for me. The misclassifications on the old mac have been reduced after I did a one-time sync with the corpus of the new mac. I occasionally have to correct mistakes on the new mac, but it’s mostly missed spam messages. Since on the old mac “rescue good messages” is disabled, there is no flip-flopping in that case. I think with occasional manual sync of the corpus this might work very well for me going forward.