Table of ContentsPreviousNextIndex

GFI MailEssentials `Bayesian' anti-spam filter

GFI MailEssentials uses bayesian filtering technology to achieve a very high detection rate of spam. Bayesian filtering technology is an adaptive, `artificial intelligence' technique that is much harder to circumvent by spammers.

How the Bayesian spam filter works

Bayesian filtering is based on the principle that most events are dependent and that the probability of an event occurring in the future can be inferred from the previous occurrences of that event. (More information about the mathematical basis of Bayesian filtering is available at

http://www-ccrma.stanford.edu/~jos/bayes/Bayesian_Parameter_Estimation.html

and http://www.niedermayer.ca/papers/bayesian/bayes.html.)

This same technique can be used to classify spam. If some piece of text occurs often in spam but not in legitimate mail, then the next time the same piece of text is encountered in a new email, it would be reasonable to assume that this email is probably spam.

Creating a tailor-made Bayesian word database

Before mail can be filtered using this method, one needs to generate a history for each word or token (such as the $ sign, IP addresses and domains, and so on). A probability value is then assigned to each word or token; the probability is based on calculations that take into account how often that word occurs in spam as opposed to legitimate mail. This is done by analyzing the users' outbound mail and by analyzing known spam: All the words and tokens in both pools of mail are analyzed to generate the probability that a particular word is spam.

For example, if the word "mortgage" occurs in 400 of 3,000 spam mails and 5 out of 300 legitimate emails, its spam probability is:

(400/3000) / (5/300 + 400/3000) = 0.8889.

Its important to note that this analysis is performed on the company's mail, and is therefore tailored to that particular company. For example, a financial institution might use the word "mortgage" many times and would get a lot of false positives if using a general anti-spam rule set. The Bayesian filter, on the other hand, takes note of the company's valid outbound mail (and recognizes "mortgage" as being frequently used in legitimate messages), and therefore has a much better spam detection rate and a far lower false positive rate.

Once the word probabilities have been calculated, the filter is ready for use.

Note that the Bayesian filter is not static - the filter is then constantly updated based on new spam and valid emails; the Bayesian filter's performance will therefore improve over time and - more importantly - will adapt to a change in spam tactics and/or a change in the kind of emails written by users within the organization.

Finding spam based on the Bayesian filter

When a new mail arrives, it is broken down into words and the most relevant words - i.e., those that are most significant in identifying whether the mail is spam or not - are singled out. From these words the Bayesian filter calculates the probability of the new message being spam or not. If the probability is greater than a threshold, say 0.9, then the message is classified as spam.

This Bayesian approach to spam is highly effective - a May 2003 BBC article reported that spam detection rates of over 99.7% can be achieved with a very low number of false positives.

Why Bayesian filtering is better than keyword checking in detecting spam

1. The Bayesian method takes the whole message into account - It recognizes keywords that identify spam, but it also recognizes words that denote valid mail. For example: not every email that contains the word "free" and "cash" is spam. The advantage of the Bayesian method is that it considers the most interesting words (as defined by their deviation from the mean) and comes up with a probability that a message is spam. The Bayesian method would find the words "cash" and "free" interesting but it would also recognize the name of the business contact who sent the message and thus classify the message as legitimate, for instance; it allows words to "balance" each other out. In other words, Bayesian filtering is a much more intelligent approach because it examines all aspects of a message, as opposed to keyword checking that classifies a mail as spam on the basis of a single word.

2. A Bayesian filter is constantly self-adapting - By learning from new spam and new valid outbound mails, the Bayesian filter evolves and adapts to new spam techniques. For example, when spammers started using "f-r-e-e" instead of "free" they succeeded in evading keyword checking until "f-r-e-e" was also included in the keyword database. On the other hand, the Bayesian filter automatically notices such tactics; in fact if the word "f-r-e-e" is found, it is an even better spam indicator. Another example would be using the word "5ex" instead of "Sex".

3. The Bayesian technique is sensitive to the user - To be successful and have their messages delivered, spammers have to send emails that are not caught by the intended victims' personalized filters. Because the Bayesian method takes the company's email profile into account, it detects spam with greater ease: Spammers would need to know the company's email profile to be able to circumvent it. Since spam mails have their own vocabulary and character, the Bayesian filter can catch them; however, it is not easy for spammers to change their sales pitch to take an organization's email profile into account; after all, there are only so many ways to sell Viagra.

4. The Bayesian method is multi-lingual and international - A Bayesian anti-spam filter, being adaptive, can be used for any language required. Most keyword lists are only available in English only and are therefore quite useless in non English-speaking regions. The Bayesian filter also takes into account certain languages deviations or the diverse usage of certain words in different areas, even if the same language is spoken. This intelligence enables such a filter to catch more spam.

5. A Bayesian filter is hard to trick as opposed to a keyword filter - An advanced spammer who wants to trick a Bayesian filter can either use fewer `bad' words (i.e., words that usually indicate spam such as free, Viagra, etc), or more words that generally indicate valid mail (such as a valid contact name, etc). Doing the latter is impossible because the spammer would have to know the email profile of each recipient - and a spammer can never hope to gather this kind of information from every intended recipient. Using neutral words, for example the word "public", would not work since these are disregarded in the final analysis. Breaking up spam words (using "f-r-e-e" instead of "free") will just increase the chance of the message being spam, since a legitimate user will rarely write the word "free" as "f-r-e-e".

IMPORTANT: Don't judge GFI MailEssentials' spam detection rate until you have allowed the Bayesian filter to run for at least 1 week! GFI MailEssentials can achieve the highest detection rate compared to other anti-spam solutions because it adapts specifically to your mail. Be patient and wait at least a week before you judge it!


Table of ContentsPreviousNextIndex