Spamassassin Bayesian filter issues

postfix
spam

(Matthieu Gaillet) #1

NethServer Version: 7.4

Hi,

I’m currently trying to get the bayesian filter included in nethserver’s mail server to work. Here are some notes and questions :

  • First of all, it should be written somewhere in the doc that spam assassin’s Bayes database requires to learn at least 200 spams AND 200 hams before beginning to filter.

  • To know the status of the bayes database, log yourself as amavis (su -- amavis) then do spamassassin -D --lint and look for bayes related entries :

Dec  9 14:07:37.639 [30801] dbg: bayes: found bayes db version 3
Dec  9 14:07:37.640 [30801] dbg: bayes: DB journal sync: last sync: 0
Dec  9 14:07:37.640 [30801] dbg: bayes: not available for scanning, only 4 ham(s) in bayes DB < 200
Dec  9 14:07:37.640 [30801] dbg: bayes: untie-ing 
  • While training for spam is not too difficult (just copy 200 of your own spams into the spam folder through the IMAP server) (just mark 200 spams using the “mark as spam” function of your mail client, copying a mail into the junk folder is not possible) making the spam filter learn 200 hams looks more difficult, since (following the documentation) the only way to mark a mail as ham is to get it out of the junk folder. I seriously doubt that any end user will accept that 200 hams gets into its spam box without yelling at the sysadmin.

  • One could try to copy 200 hams (from inbox) to the spam folder and then get it back to the inbox but that’s counter intuitive, and I’m not sure it would work anyway (mail marked learned as spam then as ham)

My proposition : configuring the INBOX folder as ham, plain and simple. That’s how Fastmail does. They even provide an way for the user to decide which folder should be learned as spam or not spam.

Is it possible ?

Thanks for helping and sharing your thought.

Matthieu


(Matthieu Gaillet) #2

Updating.

I ended up making spamassassin manually learn some ham in order to initialise the database.

sa-learn --progress --no-sync --ham /var/lib/nethserver/vmail/_mailbox@domain.tld_/Maildir/_somehamfolder_/cur/ --dbpath /var/spool/amavisd/.spamassassin/

First I did that on the INBOX folder. It took a long time, and then I noticed that there were timeouts in maillog when mail were received by postfix.

Therefore I simply took the last 200 mails from my INBOX, copied them into a HAM folder and made spamassassin make its thing on this folder. It works perfectly.

The results are immediately noticeable, the filtering looks much more effective.

I believe that there should be a “Bayesian filter” section in the mail filter tab in the GUI, that help a user to do this without too much hassle.


(Stéphane de Labrusse) #3

maybe you would like to add something there, it could help others
http://docs.nethserver.org/en/latest/mail.html


(Matthieu Gaillet) #4

Oh I didn’t noticed that the doc is editable ! I’ll surely do that, thanks Steph.

Matthieu


(Stéphane de Labrusse) #5

Yep you can amend the documentation by a pull request in gh


(Filippo Carletti) #6

Rspamd will have the interface.


(Markus Neuberger) #7

Have a look at rspamd here:


(Matthieu Gaillet) #8

:open_mouth: ! GREAT I’ll give it a try.


(Michael Kicks) #9

Please forgive my lazyness… Bayesian filter and how to use it is clearly written on documentation?
Maybe some procedures for “enable spam classification” from Webtop could be useful…


(Matthieu Gaillet) #10

As a matter of a fact, the prerequisites (200 spams AND 200 hams) aren’t written anywhere.
Marking 200 hams must be done from the command line. I’ll update it anyway.


(Stéphane de Labrusse) #11

rspamd also require 200 ham before to trust the bayesian filter. but it learns alone :heart_eyes:

It is really a nice software


(Filippo Carletti) #12

SpamAssassin learns automatically too.


(Stéphane de Labrusse) #13

yep I saw the configuration in postfix…I’m looking in it


(Matthieu Gaillet) #14

Only in the junk folder. Users have to manually mark 200 messages as ham. That means that in practise they have to mark 200 mails as spam and then mark them back as not spam. It is counter-intuitive. But I updated the documentation :wink:


(Filippo Carletti) #15

No, auto learn works even for ham.
https://wiki.apache.org/spamassassin/AutolearningNotWorking


(Matthieu Gaillet) #16

I don’t think so Fillippo, at least not using Nethserver’s implementation.

First the doc explicitly says that a ham is learned as such when it is put outside the Junk folder.

Secondly, my system shows that the number of hams don’t increase on a live system

sa-learn --dump magic 
0.000          0          3          0  non-token data: bayes db version
0.000          0        475          0  non-token data: nspam
0.000          0        211          0  non-token data: nham

211 was the exact number of hams I made it learn manually some days ago.