Bayesian Spam Filtering

The SpamAssassin software includes a "Bayesian" probability classifier, which is a database that "learns" from SpamAssassins decisions, about what is spam and what is legitimate email (often referred to as "ham").
This database of word patterns will then be used to improve the accuracy of subsequent decisions.

If the spam score is very high, or very low, Spamassassin will automatically run the message through sa-learn, and it will be reported in the X-Spam-Status: email-header as autolearn=spam or autolearn=ham. If spamassassin gets it wrong, you can correct it by updating the status manually.

Sometimes, however, SpamAssassin doesn't have enough information to decide which category a specific email belongs to, flagged as autolearn=no in the X-Spam-Status: header, and in those cases you might want to teach it what to do.

The procedure for feeding the bayesian learner is, to put a copy of the message(s) you want it to learn from, including all headers, into either /usr/spool/mail/spam or /usr/spool/mail/nonspam directory, using a filename equal to your own username.

The next time you receive email (whether ham or spam), the learner will pick up the message(s) you copied there, and delete them after processing.

References:

Spam Control at Nyx


Casper Maarbjerg, 2006-04-06