Font Size: a A A

The Research On Chinese Spam Filtering

Posted on:2013-02-27Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y WangFull Text:PDF
GTID:2248330395986732Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In modern times, with the continuous development of Internet and the numberof the e-mail user’s increase, e-mail has become a life essential communication tool.Spam, however, also will continue to spread, which brought great harm and loss tothe e-mail service providers and users. In order to solve the problem of spam,anti-spam technology is continually developing and improving, the spammingtechniques based on machine learning has become the mainstream of the anti-spamtechnology.Machine Learning Method can be applied to Spam Filtering for differentlanguage, In the former research mostly have analysis the study of Spam Filtering onEnglish, where the study on the Chinese is little. In order to find more suitable modelof filtering and technology for Chinese spam filtering, this paper mainly do therelevant research and analysis for the Chinese spam filter.First, this paper, from the Chinese spam, and analyses some features, filtrationand some algorithms of filtering model based on Machine Learning of Chinese spam,which provides theoretical basis for the next research. This paper does some study onthe variety methods of feature extraction, uses4-gram feature extraction and analysisthe advantages of it, and gives the specific extracting process. Through the analysis,the paper uses the online model of filter, which enhances the ability to adapt to thefilter.Secondly, this article mainly analyzed the algorithm principle of filtering of thegenerated model which represented on the Na ve Byes model and with thediscriminate model which represented on a logistic regression model and RelaxedSupport Vector Machine Model in the Chinese data set filter performance, and onthis basis, improving some method and choosing and debugging some parameters, toachieve the optimal model of filter on Chinese data set. And then, we compared thefiltering performance of three models on four Chinese data sets. Through theanalysis of the results, it is that the performance of the discriminate model is better,and effect of Relaxed Support Vector Machine Model is better, and achieved the bestperformance. Lastly, in the above part, it is proved that the performance is best on the modelof discrimination on the Chinese data set. In order to improve the performance offilter, through the study, this paper introduced two active learning algorithms:b-Sampling and TONE, and adjustment algorithms and adjust the best parameters.We use the same four Chinese data sets on the Logistic Regression and RelaxedOnline Support Vector Machine. Compared and analysis these two methods and NOactive learning method. In the view of the experimental results, the effect of theintroduction of active learning method is better. The effect of TONE is better thanb-Sampling. For the cost of time of TONE, we only test the time on the ROSVM onthe data of SEWM11, and find that the active learning method reduces nearly tentimes.
Keywords/Search Tags:machine learning, feature extraction, generation model, the discriminatemodel, active learning
PDF Full Text Request
Related items