Spamassassin Bayesian filter issues

pagaille · December 9, 2017, 1:05pm

NethServer Version: 7.4

Hi,

I’m currently trying to get the bayesian filter included in nethserver’s mail server to work. Here are some notes and questions :

First of all, it should be written somewhere in the doc that spam assassin’s Bayes database requires to learn at least 200 spams AND 200 hams before beginning to filter.
To know the status of the bayes database, log yourself as amavis (su -- amavis) then do spamassassin -D --lint and look for bayes related entries :

Dec  9 14:07:37.639 [30801] dbg: bayes: found bayes db version 3
Dec  9 14:07:37.640 [30801] dbg: bayes: DB journal sync: last sync: 0
Dec  9 14:07:37.640 [30801] dbg: bayes: not available for scanning, only 4 ham(s) in bayes DB < 200
Dec  9 14:07:37.640 [30801] dbg: bayes: untie-ing

While training for spam is not too difficult ~~(just copy 200 of your own spams into the spam folder through the IMAP server)~~ (just mark 200 spams using the “mark as spam” function of your mail client, copying a mail into the junk folder is not possible) making the spam filter learn 200 hams looks more difficult, since (following the documentation) the only way to mark a mail as ham is to get it out of the junk folder. I seriously doubt that any end user will accept that 200 hams gets into its spam box without yelling at the sysadmin.
One could try to copy 200 hams (from inbox) to the spam folder and then get it back to the inbox but that’s counter intuitive, and I’m not sure it would work anyway (mail marked learned as spam then as ham)

My proposition : configuring the INBOX folder as ham, plain and simple. That’s how Fastmail does. They even provide an way for the user to decide which folder should be learned as spam or not spam.

Is it possible ?

Thanks for helping and sharing your thought.

Matthieu

pagaille · December 10, 2017, 11:33am

Updating.

I ended up making spamassassin manually learn some ham in order to initialise the database.

sa-learn --progress --no-sync --ham /var/lib/nethserver/vmail/_mailbox@domain.tld_/Maildir/_somehamfolder_/cur/ --dbpath /var/spool/amavisd/.spamassassin/

First I did that on the INBOX folder. It took a long time, and then I noticed that there were timeouts in maillog when mail were received by postfix.

Therefore I simply took the last 200 mails from my INBOX, copied them into a HAM folder and made spamassassin make its thing on this folder. It works perfectly.

The results are immediately noticeable, the filtering looks much more effective.

I believe that there should be a “Bayesian filter” section in the mail filter tab in the GUI, that help a user to do this without too much hassle.

stephdl · December 10, 2017, 6:22pm

maybe you would like to add something there, it could help others
http://docs.nethserver.org/en/latest/mail.html

pagaille · December 11, 2017, 9:51am

Oh I didn’t noticed that the doc is editable ! I’ll surely do that, thanks Steph.

Matthieu

stephdl · December 11, 2017, 9:53am

Yep you can amend the documentation by a pull request in gh

filippo_carletti · December 11, 2017, 10:48am

Rspamd will have the interface.

mrmarkuz · December 11, 2017, 11:02am

Have a look at rspamd here:

pagaille · December 11, 2017, 11:11am

! GREAT I’ll give it a try.

pike · December 11, 2017, 12:43pm

Please forgive my lazyness… Bayesian filter and how to use it is clearly written on documentation?
Maybe some procedures for “enable spam classification” from Webtop could be useful…

pagaille · December 12, 2017, 7:05pm

As a matter of a fact, the prerequisites (200 spams AND 200 hams) aren’t written anywhere.
Marking 200 hams must be done from the command line. I’ll update it anyway.

stephdl · December 13, 2017, 7:21am

rspamd also require 200 ham before to trust the bayesian filter. but it learns alone

It is really a nice software

filippo_carletti · December 13, 2017, 10:31am

SpamAssassin learns automatically too.

stephdl · December 13, 2017, 10:33am

yep I saw the configuration in postfix…I’m looking in it

pagaille · December 13, 2017, 10:35am

Only in the junk folder. Users have to manually mark 200 messages as ham. That means that in practise they have to mark 200 mails as spam and then mark them back as not spam. It is counter-intuitive. But I updated the documentation

filippo_carletti · December 13, 2017, 10:58am

No, auto learn works even for ham.
https://wiki.apache.org/spamassassin/AutolearningNotWorking

pagaille · December 13, 2017, 11:26am

I don’t think so Fillippo, at least not using Nethserver’s implementation.

First the doc explicitly says that a ham is learned as such when it is put outside the Junk folder.

Secondly, my system shows that the number of hams don’t increase on a live system

sa-learn --dump magic 
0.000          0          3          0  non-token data: bayes db version
0.000          0        475          0  non-token data: nspam
0.000          0        211          0  non-token data: nham

211 was the exact number of hams I made it learn manually some days ago.

pagaille · March 7, 2019, 4:53pm

Update for anybody looking for the same using rspamd (new mail module)

The command rspamc learn_ham /var/lib/nethserver/maildir/user@domain/Maildir/cur

does the trick (learning INBOX as ham)`

I still believe that this suggestion is actual :

pike · March 7, 2019, 8:45pm

Well, command
rspamc learn_ham /var/lib/nethserver/maildir/user@domain/Maildir/cur
should be delivered as soon as possible into documentation for help mail domain migration…

mrmarkuz · March 7, 2019, 8:55pm

I found that it’s already mentioned in the docs but it may be hard to find:

http://docs.nethserver.org/en/v7/rspamd.html#frequently-asked-questions

https://rspamd.com/doc/faq.html#how-can-i-learn-messages

What do you think?

pike · March 7, 2019, 9:00pm

Maybe a “Best Practice” page or “checklist for a nice migration” could be the right way to summarize a way to migrate or create a good mail server.
Unfortunately my english is not sleek and catchy enough for being concise, clear and effective.

This should include (IMVHO)

Data collect for get all the informations needed
Steps to be done to install, add modules, configure data, users and aliases
Steps for migrate mailboxes and preferences (import from files, POP3/IMAP collect, IMAP transfer via client)
Steps for protect server (antivirus check, fail2ban setup, password set)
Steps for start data transfer to the real world (DNS, MX Record, port forwarding)
Steps for improve reputation for not being marked as spam (TLS Certificate, SPF, DKIM, DMARC)
Steps for improve SPAM detection

This should be a “scenario” detailed enough to help conscious and skilled enough sysadmins (even junior ones) to have the full toolbox for adapt it to their environment.
Also, concise enough to be a four A4 printed sheets (plus one for data collect) to kickstart any installation.
(the document should be unrelated to groupware or webmail, but also suggest to look for both of them before take a direction or another)