The PostConf GUI makes it easy to train Spamassassin's Bayes DB with
front-ends to postcat, postsuper and sa-learn. Trouble is that it is also
easy to corrupt a Bayesian DB and actually cause the server to receive more
spam. How can you best take advantage of this powerful anti-spam tool?
Here are a few hard-learned Bayes tips and tricks.
* Never feed an entire folder to sa-learn. Though this contradicts
common practice Bayes pattern recognition actually works better with fewer
messages of higher quality.
* Manually review the entire header AND body of every message submitted
to sa-learn. This is the ONLY way to avoid a poisoned Bayes DB.
Note that seemingly harmless messages can carry Bayes-poisoning text,
often in hidden HTML formatting tags, many pages into the body. If a
spam email seems too long truncate it and only sa-learn the first
80 or 100 lines.
* Don't feed attachments to a Bayes DB. Even with good decoders
Spamassassin is too easily tricked into learning Bayes-poison by
base64, image, pdf and other encoding methods.
* If an email isn't obviously spam (or obviously non-spam) don't feed
it to sa-learn. Just as giving a puppy mixed messages doesn't help it
become a well behaved dog Bayesian pattern recognition doesn't work
well with ambiguous input.
* We don't recommend allowing end-users to train either their own or
a site-wide Bayes database. Even email administrators who know a lot
about filtering spam often have difficulty training Bayes DBs. End
users with no training, no time and no discipline can only poison a
Bayes DB and in doing so lower the effectiveness of other spam
filters.
* Disable AWL (auto whitelisting) or set auto_whitelist_factor
to 0.1 or 0.2 and disable auto-learning entirely, at least until
these promising technologies become more mature.
Post-Queue Actions:
* Never bounce an email that was received and tagged as spam. This
is known as backscatter and is itself spam even if the message body
is truncated. Since pattern recognition takes time and can rarely be
completed before the delivery process timeout you must accept and
DISCARD messages that pass RBLs and simple content checks but
subsequently fail Bayes and other content analysies.
Administrative:
Since it is always possible to accidentally poison a Bayesian
database systems administrators should keep daily backups of all
bayes files maintained by Spamassassin. If (or rather when)
message headers start to indicate poison (BAYES_40 and below
in messages that are clearly spam) roll-back to an earlier db.
References:
* Bayes spam filtering
Return to Documentation Home
|