Ending Spam - Bayesian Approach : Part I
Recently, I stumbled on to an article which had a debate between Fischerian approach Vs Bayesian approach for design inference. According to the author, both had drawbacks when dealing with events with small probabilities. I became curious to know the bayesian side of things. My work in the past few years did make me use Fischerian approach in many forms..But What about Bayesian world ? Except the famous MontyHall problem and few other general examples, I had not read or understood clearly , how a bayesian approach could be put to real use. Many years ago, reading, “Hackers and Painters” , I did come across an entire chapter of applying bayesian principles to fight spam, written elegantly by Paul Graham. However at that point of time, my motivation levels for bayesian approach were low..May be I never understood anything at all, except have a 10,000 ft view of the entire branch of statistics.
Why bayesian now ? My experiments with R, the open source software for Statistics , was becoming rather boring. It was the same old wine in new package, same old commands with new syntax. The same old, t tests, hypothesis testing, regression, multivariate regression, logistic reg, multinomial, garch, etc etc..It was deja vu for me…Yes, all these techniques were indeed interesting..Infact I am tutoring a Phd student on using the raw data for thesis dissertation.Somehow, whenever I am dealing with the Fischerian world, I am more inclined to believe that it is used as a sophisticated wrap to our beliefs rather than validating our beliefs in a meaningful way…I guess that is where Bayesian probability comes in. When I stumbled on to this book, I wanted to go through the entire book in one sitting and just understand once and for all , why should i consider / ignore bayesian world ? and so it happened that one afternoon I sat with this book with only one thing in my mind , “What’s the practical real world application of Baye’s ? “.
“Ending Spam” provided me with a solid answer. Let me try to summarize the key points of the book :
First, some history about it :
The world’s first spam message was sent by a marketing manager in 1978 at DEC, which raised a furore in the then low bandwidth arpanet network. It was subsequently followed by College Fund spams, Jesus Spam:),notorious couple, Canter & Siegel who became famous for writing a software program to spam, Jeff Slaton, the spam King, Flood gate (first spamware). There were a lot of individual ineffective battles that were faught by anti-spammers. Blacklists, @abuse addresses, etc etc..But nothing was effective. By 2002, the spam reached 40% of the total internet traffic. The solution was becoming elusive.
Initial Tools:
Blacklists , centralized blacklists were the first solutions to the spam problem. Email software gave user the choice of blacklisting senders based on their source email address or a set of specific words. This gave rise to a a lot of false positives. However this became popular it was very easy to implement and customize.Maintenance of the blacklists was a big problem
Heuristic Filtering,came next. Users connected to a centralized service which downloaded the mails from users ISP, ran those mails through a set of heuristics and lookups and acted as a filter . This was effective for sometime until hackers learnt to get by the rules. Also, heuristic filter applied a universal score and based on a set threshold score, classified a message as a spam or Ham( a word generally used for a relevant message). There were a lot of maintenance headaches as the server lists had to be updated regularly
Whitelisting , was next. Only the allowed users can send an email . This cuts spam completely but at the same time is also not good as it cuts off the entire world. Also, forgeries can plague this system.
Challenge Response where the senders had to do the job of spam filtering. This was news to me..Since I have never seen this kind of mail ever, it was so surprising to learn that such systems were used to fight spam..Look at the mail below, i bet, you will be surprised too
Greetings,
You just sent an email to my spam-free email service. Because this is the first time you have sent to this email account, please confirm yourself so you’ll be recognized when you send to me in the future. It’s easy. To prove your message comes from a human and not a computer, click on the link below: http://[Some Web Link] . Attached is your original message that is in my pending folder, waiting for your quick authentication.
Throttling was probably one of the most sane means of attacking spam.The philosophy behind throttling is that a legitimate mail distribution would never need to send more than a certain threshold of traffic to any particular network. For example, a legitimate mailing list may send out huge quantities of mail, but each message is going to different recipients on different networks. At most, only a handful of the messages going out would be directed to any one network. A spammer, on the other hand, may have scripts designed to bombard a network with spam by using a dictionary attack, in which every possible username is generated.
Collaborative filtering , address obfuscation, litigation were few other methods used to cut down spam..However spam continued.
Language Classification : Use spammer’s weapon against the spammer. One of the things which a spam filter has at its disposal is the content of the message. A language classification filter is a machine learning tool which does the following: It first gets trained on a corpus , tokenizes the incoming mail and then assigns probabilities to each token in the message. Ultimately it assigns a spam score based on joint probabilities and classified the entire message as SPAM or NOT A SPAM. The most wonderful thing about this approach is that the user is in control to decide the corpus , user is in a position to provide feedback to the filter and thus filter works on a customized basis. This approach where the initial a priori probabilities are revised , and new posterior probabilities are computed
In my next post, I will review the basic ideas of statistical filtering mentioned in the book.