Font Size: a A A

Research Of Chinese Spam Filtering Algorithm Based On Bayes Theory

Posted on:2010-10-22Degree:MasterType:Thesis
Country:ChinaCandidate:L Q BaoFull Text:PDF
GTID:2178360278451569Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid popularization of Internet, e-mail has become one of the primary means of communication. But much attention is also aroused by the flood of spam, spam not only waste user's time and energy, use up a lot of network bandwidth and storage resources, they also bring potential secure problems of network and information.Therefore spam filtering is a subject with important practical significance.Content-based spam filtering technology is an important anti-spam technology, which at present is mainly based on words filtering, rule-based techniques and statistical learning methods. Na(?)ve Bayes algorithm which is based on probability and statistics has been widely used in the area of spam filtering for its simplicity, efficiency and accuracy. However, it also has shortcomings, such as it can not be applied to Chinese e-mail filtering commendably, does not take into account the risk of miscarriage of justice, should not take incremental learning.Analyzes the classification differences between english and Chinese emails, discusses the chinese e-mail pre-processing technology, including e-mail analysis, chinese word segmentation and feature selection,then apply Na(?)ve Bayesian algorithm to Chinese e-mail filtering. Misclassifying legitimate mail as spam will lead to greater loss of users, the traditional Bayesian algorithm does not take into account of this difference. Introduced the idea of minimizing the loss, a least risk Bayesian algorithm is proposed,The algorithm can achieve user's purpose by adjusting the value of loss weight.Because of the shortage of information storage, Bayesian classifier will easily make the classification of new emails incorrectly, if these incorrectly labeled emails are added to the Bayes classifier early, it will reduce the performance of Bayesian classifier. In addition, traditional Bayesian classifier will cost a lot of time to learn all emails again . For resolving these problems, an incremental learning algorithm based on user's feedback is put forward, the algorithm is based on least risk Bayesian classifier, in order to learn new samples to modify the classifier and gives the calculating formula for incremental learning.The algorithm proposed in this paper is implemented using JAVA language, the experimen obtains a set of preferable parameters based on the elicited parameters of characteristic number, loss factor, as well as the relationship among filtering outcomes on CDSCE corpus. The results also show that the incremental learning algorithm based on user's feedback is superior in performance to traditional Bayesian algorithm.
Keywords/Search Tags:Bayesian Algorithm, Spam Filtering, Chinese Word Segmentation, Minimum Risk, Incremental Learning
PDF Full Text Request
Related items