Font Size: a A A

Research On Technologies Of Spam Filtering

Posted on:2010-05-06Degree:MasterType:Thesis
Country:ChinaCandidate:D N OuFull Text:PDF
GTID:2178360278473099Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Nowadays with the rapid development and popularization of the Internet, electronic mail (E-mail) has become one of the most important communication methods in our daily life owing to its convenience and cheapness. The problem of junk mail(also referred to as "spam"), however, become more and more serious in recent years. Facing unscrupulous junk mails, people adopt many technique to fight against it. Currently, Spam filtering is most commonly used method of anti-spam. In this paper, we focus on some critical issues of spam filtering.Since there are distinct characteristics of words and styles in the contents of spam, and the topic contained in spam seldom exists in legal e-mail, it is a effective way to filter spam by adopting the technologies of automatic text categorization. It has become a hot research topic in anti-spam filtering at present. We refer it as "content-based spam filtering" in this paper.After a summary of the existing content-based spam filtering is given, we point out three critical issues in this topic: classification algorithm, feature selection method, Chinese token-cutting algorithm. Some comparative experiments results are given and show that SVM classification model and IG-based feature selection method have superiorities over others. We also implement three Chinese token-cutting algorithms and adopt them in spam filtering. The result show simple 2-gram cutting do well as Maximum Match word segmentation algorithm, and, to our surprise, the simplest 1-gram cutting can outperform over the both. In addition, owning to the particularity of spam filtering, keeping the stop words and punctuations can benefit to classifying the E-mail correctly.Content-based spam filter is prone to be hoodwinked by anti-filtering tricks. On the observations that spam sender will add a URL in their mails in most case, we propose a novel spam filtering method based on analysis of in-body URL's characteristics. We extract the features associated with URL and adopt machine learning technology to train model and predict the new incoming mails. The experiments show it is a fast and effective spam filtering technique.An important trend in this field is combining many sole techniques to filter spam. On the foundation of previous works, we at last study and implement a combining strategy based on an improved AdaBoost algorithm. Combining Strategy is the critical issue of integrating all kind of techniques to implement practical spam filtering system. We apply this strategy and get good results in our experiments. In the same corpus, our result outperforms the best result of first stage task of SEWM2008 Spam Track.
Keywords/Search Tags:Spam filtering, Text Categorization, SVM, URL, AdaBoost
PDF Full Text Request
Related items