Research Of Chinese Spam Filtering Algorithm Based On Bayes Theory

Posted on:2010-10-22

Degree:Master

Type:Thesis

Country:China

Candidate:L Q Bao

Full Text:PDF

GTID:2178360278451569

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid popularization of Internet, e-mail has become one of the primary means of communication. But much attention is also aroused by the flood of spam, spam not only waste user's time and energy, use up a lot of network bandwidth and storage resources, they also bring potential secure problems of network and information.Therefore spam filtering is a subject with important practical significance.Content-based spam filtering technology is an important anti-spam technology, which at present is mainly based on words filtering, rule-based techniques and statistical learning methods. Na(?)ve Bayes algorithm which is based on probability and statistics has been widely used in the area of spam filtering for its simplicity, efficiency and accuracy. However, it also has shortcomings, such as it can not be applied to Chinese e-mail filtering commendably, does not take into account the risk of miscarriage of justice, should not take incremental learning.Analyzes the classification differences between english and Chinese emails, discusses the chinese e-mail pre-processing technology, including e-mail analysis, chinese word segmentation and feature selection,then apply Na(?)ve Bayesian algorithm to Chinese e-mail filtering. Misclassifying legitimate mail as spam will lead to greater loss of users, the traditional Bayesian algorithm does not take into account of this difference. Introduced the idea of minimizing the loss, a least risk Bayesian algorithm is proposed,The algorithm can achieve user's purpose by adjusting the value of loss weight.Because of the shortage of information storage, Bayesian classifier will easily make the classification of new emails incorrectly, if these incorrectly labeled emails are added to the Bayes classifier early, it will reduce the performance of Bayesian classifier. In addition, traditional Bayesian classifier will cost a lot of time to learn all emails again . For resolving these problems, an incremental learning algorithm based on user's feedback is put forward, the algorithm is based on least risk Bayesian classifier, in order to learn new samples to modify the classifier and gives the calculating formula for incremental learning.The algorithm proposed in this paper is implemented using JAVA language, the experimen obtains a set of preferable parameters based on the elicited parameters of characteristic number, loss factor, as well as the relationship among filtering outcomes on CDSCE corpus. The results also show that the incremental learning algorithm based on user's feedback is superior in performance to traditional Bayesian algorithm.

Keywords/Search Tags:

Bayesian Algorithm, Spam Filtering, Chinese Word Segmentation, Minimum Risk, Incremental Learning

PDF Full Text Request

Related items

1	Algorithm. Bayesian Spam Filtering Technology Research And Application
2	Spam Filtering System Based On Bayesian Algorithm Research And Design
3	Study On Spam Filtering Technology Based Bayes
4	Development And Research Of Spam Filter System Based On Bayesian Algorithm
5	Research And Implementation Of Chinese Spam Filter Technology Based On Content Mining
6	Based On Minimal Risk, Bayesian Multi-level Spam Filtering System
7	An Incremental-styled Learning Chinese Word Segmentation System Based On Perceptron Algorithm Design And Implementation
8	The Study Of College Digital Campus Web Information Filtering
9	Research On Chinese Spam Filtering Technology
10	Study And Application On Chinese-Spam Filtering Technology