The Research On Chinese Spam Filtering

Posted on:2013-02-27

Degree:Master

Type:Thesis

Country:China

Candidate:Y Y Wang

Full Text:PDF

GTID:2248330395986732

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

In modern times, with the continuous development of Internet and the numberof the e-mail userâ€™s increase, e-mail has become a life essential communication tool.Spam, however, also will continue to spread, which brought great harm and loss tothe e-mail service providers and users. In order to solve the problem of spam,anti-spam technology is continually developing and improving, the spammingtechniques based on machine learning has become the mainstream of the anti-spamtechnology.Machine Learning Method can be applied to Spam Filtering for differentlanguage, In the former research mostly have analysis the study of Spam Filtering onEnglish, where the study on the Chinese is little. In order to find more suitable modelof filtering and technology for Chinese spam filtering, this paper mainly do therelevant research and analysis for the Chinese spam filter.First, this paper, from the Chinese spam, and analyses some features, filtrationand some algorithms of filtering model based on Machine Learning of Chinese spam,which provides theoretical basis for the next research. This paper does some study onthe variety methods of feature extraction, uses4-gram feature extraction and analysisthe advantages of it, and gives the specific extracting process. Through the analysis,the paper uses the online model of filter, which enhances the ability to adapt to thefilter.Secondly, this article mainly analyzed the algorithm principle of filtering of thegenerated model which represented on the Na ve Byes model and with thediscriminate model which represented on a logistic regression model and RelaxedSupport Vector Machine Model in the Chinese data set filter performance, and onthis basis, improving some method and choosing and debugging some parameters, toachieve the optimal model of filter on Chinese data set. And then, we compared thefiltering performance of three models on four Chinese data sets. Through theanalysis of the results, it is that the performance of the discriminate model is better,and effect of Relaxed Support Vector Machine Model is better, and achieved the bestperformance. Lastly, in the above part, it is proved that the performance is best on the modelof discrimination on the Chinese data set. In order to improve the performance offilter, through the study, this paper introduced two active learning algorithms:b-Sampling and TONE, and adjustment algorithms and adjust the best parameters.We use the same four Chinese data sets on the Logistic Regression and RelaxedOnline Support Vector Machine. Compared and analysis these two methods and NOactive learning method. In the view of the experimental results, the effect of theintroduction of active learning method is better. The effect of TONE is better thanb-Sampling. For the cost of time of TONE, we only test the time on the ROSVM onthe data of SEWM11, and find that the active learning method reduces nearly tentimes.

Keywords/Search Tags:

machine learning, feature extraction, generation model, the discriminatemodel, active learning

PDF Full Text Request

Related items

1	The Research And Implementation Of Singing Voice Detection System Based On Active Learning Technique
2	Design And Implementation Of Feature Extraction System For Large-Scale Structured Data
3	Machine Learning Based Complex Surface Feature Extraction And Segmentation Method And Its Applications
4	Research On Active Learning Algorithm Based On Extreme Learning Machine
5	Research On Feature Description And Classifier Construction Algorithm In Chinese Text Classification
6	Research On The Trend Feature Extraction Of Securities Data Based On Machine Learning
7	Study Of Active Learning Algorithms On Imbalanced Data Using Extreme Learning Machine
8	Research And Application Of Machine Learning In Gesture Recognition
9	Application Research On Feature Extraction And Classification Of EEG Signal With The Method Of ELM
10	Research On Detection Algorithm Of Extortion Software Based On Machine Learning