Font Size: a A A

Research And Improvement Of Chinese Spam Emails Filtering Method Based On Bayesian Classification

Posted on:2007-11-20Degree:MasterType:Thesis
Country:ChinaCandidate:R HuFull Text:PDF
GTID:2178360212985429Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Email has become an increasingly important channel for people to exchange and acquire information. With the advances of Internet, email has become one of the most important applications online. At the same time, junk emails (spams) are becoming an increasingly serious security problem, attracting attentions from both the research community and the general populace.This thesis systematically analyzed the characteristics of the Chinese junk emails, and applied the Bayesian classification method to the Chinese spam filtering. In particular, we focused on the word segmentation problem, which is the foundation of the Bayesian classification method. We compare several word segmentation methods, including dictionary based, N-gram method and manual method for Bayes method in filtering Chinese spam emails. The performances of the different word-segmentation techniques are compared quantitatively. We found that the N-gram based method is more accurate in identifying both spams and legitimate emails when compared to the traditional method of using dictionary segment method. The optimal N-gram length range is between 2 and 3, reflecting the characteristics of Chinese word-segmentation. In addition, without the dictionary search complexity, the N-gram based method is several times faster in processing speed. The CDSCE dataset is a large Chinese language spam emails corpus collected and maintained by CERNET (Chinese Education and Research Network). This thesis also utilizes CDSCE to demonstrate the efficiency and effectiveness of N-gram based Bayes classification method. Above 97% of Chinese spams are caught and only about 1% of the normal emails are misclassified. This work should provide theoretical and practical guidance for the use of machine learning techniques to the Chinese spam filtering problem.This paper thesis also discusses primary the application of Bayes classification to the filtering of method we modified and improved in short message spams– an emerging problem today. It is found that efficient methodfor spam emails filtering doesn't achieve expected satisfactory results in filtering short message. This leads us to believe that other methods must be developed for short message spams with techniques appropriate for the specific problem.Based on method proposed in the thesis, a software implementation has also been developed on the Microsoft Windows platform for the master thesis.
Keywords/Search Tags:Email, Spam, Bayesian, segmentation, N-gram
PDF Full Text Request
Related items