PostConf | Docs | Bayes Tips and Tricks


The PostConf GUI makes it easy to train Spamassassin's Bayes DB with 
front-ends to postcat, postsuper and sa-learn.  Trouble is that it is also
easy to corrupt a Bayesian DB and actually cause the server to receive more
spam.  How can you best take advantage of this powerful anti-spam tool?
Here are a few hard-learned Bayes tips and tricks.

    * Never feed an entire folder to sa-learn.  Though this contradicts
    common practice Bayes pattern recognition actually works better with fewer
    messages of higher quality.

    * Manually review the entire header AND body of every message submitted
    to sa-learn.  This is the ONLY way to avoid a poisoned Bayes DB.

    Note that seemingly harmless messages can carry Bayes-poisoning text,
    often in hidden HTML formatting tags, many pages into the body.  If a
    spam email seems too long truncate it and only sa-learn the first
    80 or 100 lines.

    * Don't feed attachments to a Bayes DB.  Even with good decoders
    Spamassassin is too easily tricked into learning Bayes-poison by
    base64, image, pdf and other encoding methods.

    * If an email isn't obviously spam (or obviously non-spam) don't feed
    it to sa-learn.  Just as giving a puppy mixed messages doesn't help it
    become a well behaved dog Bayesian pattern recognition doesn't work
    well with ambiguous input.

    * We don't recommend allowing end-users to train either their own or
    a site-wide Bayes database.  Even email administrators who know a lot
    about filtering spam often have difficulty training Bayes DBs.  End
    users with no training, no time and no discipline can only poison a
    Bayes DB and in doing so lower the effectiveness of other spam
    filters.

    * Disable AWL (auto whitelisting) or set auto_whitelist_factor
    to 0.1 or 0.2 and disable auto-learning entirely, at least until
    these promising technologies become more mature.

Post-Queue Actions:

    * Never bounce an email that was received and tagged as spam.  This
    is known as backscatter and is itself spam even if the message body 
    is truncated.  Since pattern recognition takes time and can rarely be
    completed before the delivery process timeout you must accept and
    DISCARD messages that pass RBLs and simple content checks but
    subsequently fail Bayes and other content analysies.

Administrative:

    Since it is always possible to accidentally poison a Bayesian
    database systems administrators should keep daily backups of all
    bayes files maintained by Spamassassin.  If (or rather when)
    message headers start to indicate poison (BAYES_40 and below
    in messages that are clearly spam) roll-back to an earlier db.

References:

    * Bayes spam filtering

Return to Documentation Home