Font Size: a A A

Based On Bayesian Chinese Spam Filter System Design And Implementation

Posted on:2008-05-06Degree:MasterType:Thesis
Country:ChinaCandidate:Z G HuangFull Text:PDF
GTID:2208360215950201Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Electronic mail (e-mail) is becoming one of the fastest and most economical ways of communication available.At the same time, the growing problem of junk mail (also referred to as"spam") has generated a need for e-mail filtering.Spam is very serious in China, because the e-mail filtering technology developed later in China relative to developed country.Nowadays, China has been the second most serious country in the world about the spam. Anti-spam measures commonly include black or white list technology,manual rules , keyword based content filtering and text categorization.Such algorithms of text categorization as Na?ve Bayes,KNN,Decision Tree and Boosting can be applied in spam filtering Bayesian classification algorithms is most popular used,because of its easy to design and high decision features.There are some questions below have to be solved first when use Bayesian classification algorithms on Chinese email filter.There are many obviously differences in written and expression between Chinese and English. In the written mode, the English words are separated by spaces and punctuations, while the Chinese words no distinct separate character.Therefore, we have to make the Chinese words segmentation operation before filtering Chinese-mails using Bayesian classification algorithm.Further, the effects of this operation can significantly affect to the classification results. In the expression, it's also different between Chinese and English, and it's necessary to extract the Chinese-mail features. With the use of Chinese characteristics, a large number of special structure appear in junk mail called bad information, which is easy to understand and difficult to identify.In this paper, we give a Chinese-based Bayesian spam filtering system, which applies Bayesian algorithms to filter Chinese-mail. To improve the system performance, we study all kinds of technologies adoptted in Chinese spam filtering, including the Chinese word segmentation techniques, extracting feature technology and bad information identify technique. Through the analysis of the Chinese word segmentation algorithm and the advantages and disadvantages for Chinese dictionary, this paper gives an improved data dictionary. A lexical algorithms combining the algorithms of improving string matching and identifing unknown word to enhance the efficiency and accuracy of segmentation. According to the advantages and disadvantages of various feature extraction algorithms we adopt the algorithm based on the distribution of feature and the keyword weights. Then we can get threshold through experiments. Finally we present a Chinese-mile segmentation algorithm and design a phonetic dictionary, which is proved to effectively identificate bad information.
Keywords/Search Tags:email filtering, Bayesian classification, Chinese segmentation, feature extraction, bad information identify
PDF Full Text Request
Related items