Table of ContentsPreviousNextIndex

How the Bayesian spam filter works

Bayesian filtering is based on the principle that most events are dependent and that the probability of an event occurring in the future can be inferred from the previous occurrences of that event. (More information about the mathematical basis of Bayesian filtering is available at Bayesian Parameter Estimation and An Introduction to Bayesian Networks and their Contemporary Applications)

(http://www-ccrma.stanford.edu/~jos/bayes/Bayesian_Parameter_Estimation.html

& http://www.niedermayer.ca/papers/bayesian/bayes.html.)

This same technique can be used to classify spam. If some piece of text occurs often in spam but not in legitimate mail, then it would be reasonable to assume that this email is probably spam.

Creating a tailor-made Bayesian word database

Before mail can be filtered using this method, the user needs to generate a database with words and tokens (such as the $ sign, IP addresses and domains, and so on), collected from a sample of spam mail and valid mail (referred to as `ham').

Figure 1 - Creating a word database for the filter

A probability value is then assigned to each word or token; the probability is based on calculations that take into account how often that word occurs in spam as opposed to legitimate mail (ham). This is done by analyzing the users' outbound mail and by analyzing known spam: All the words and tokens in both pools of mail are analyzed to generate the probability that a particular word points to the mail being spam.

This word probability is calculated as follows: If the word "mortgage" occurs in 400 of 3,000 spam mails and in 5 out of 300 legitimate emails, for example, then its spam probability would be 0.8889 (that is, [400/3000] divided by [5/300 + 400/3000]).

Creating the ham database (tailored to your company)

It's important to note that the analysis of ham mail is performed on the company's mail, and is therefore tailored to that particular company. For example, a financial institution might use the word "mortgage" many times over and would get a lot of false positives if using a general anti-spam rule set. On the other hand, the Bayesian filter, if tailored to your company through an initial training period, takes note of the company's valid outbound mail (and recognizes "mortgage" as being frequently used in legitimate messages), and therefore has a much better spam detection rate and a far lower false positive rate.

Note that some anti-spam software with very basic Bayesian capabilities, such as the Outlook spam filter or the Internet Message Filter in Exchange Server, does not create a tailored ham data file for your company, but ships a standard ham data file with the installation. Although this method does not require an initial learning period, it has 2 major flaws:

  1. The ham data file is publicly available and can thus be hacked by professional spammers and therefore bypassed. If the ham data file is unique to your company, then hacking the ham data file is useless. For example, there are hacks available to bypass the Microsoft Outlook 2003 or Exchange Server spam filter. For more information about this, see Microsoft Outlook 2003 Spam Filter: Under the hood
  2. Secondly the ham data file is a general one, and thus not tailored to your company, it cannot be as effective and you will suffer from noticeably higher false positives.
Creating the spam database

Besides ham mail, the Bayesian filter also relies on a spam data file. This spam data file must include a large sample of known spam and must be constantly updated with the latest spam by the anti-spam software. This will ensure that the Bayesian filter is aware of the latest spam tricks, resulting in a high spam detection rate (note: this is achieved once the required initial two-week learning period is over).

How the actual filtering is done

Once the ham and spam databases have been created, the word probabilities can be calculated and the filter is ready for use.

When a new mail arrives, it is broken down into words and the most relevant words - i.e., those that are most significant in identifying whether the mail is spam or not - are singled out. From these words, the Bayesian filter calculates the probability of the new message being spam or not. If the probability is greater than a threshold, say 0.9, then the message is classified as spam.

This Bayesian approach to spam is highly effective - a May 2003 BBC article reported that spam detection rates of over 99.7% can be achieved with a very low number of false positives!


Table of ContentsPreviousNextIndex