Research On Chinese Spam Filtering Method

Posted on:2017-01-29

Degree:Master

Type:Thesis

Country:China

Candidate:R Y Wei

Full Text:PDF

GTID:2348330482999743

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

An increasing number of spam has brought great inconvenience to people’s daily life. In terms of our country, such a large population with a great number of emails to send and receive, more resources will be wasted in dealing with spam.Naive bayes algorithm has been widely used in the area of spam filtering with its fast computing speed and easily implemention. In the process of filtering, word segmentation and feature extraction are two very important phases. At present, in most spam filtering methods for Chinese, word segmentation process is often very complex; when faced with a mass email the training sample, with the characteristic of the words as a text item unit, time efficiency of this algorithm will become a bottleneck in the mail filtering technology; On the other hand,in the feature extraction of the characteristics of the existing evaluation function is not completely accords with the characteristics of spam, Represent ability of the extracted characteristics is not strong enough. For this problem, in this paper, in order to improve the filterability anti-spam for the goal, to do a thorough research,Main work is as follows:In the segmentation stage of pretreatment, we use TRIE tree structure as dictionary carrier, combing with the positive maximum matching principle,then combine with phrases analysis methods which is proposed in text categorization, using limited semantic analysis such as basic noun phrases, verb phrases to convert vector space model from the words pattern in to basic phrases pattern. This method can make the segmentation precision and the efficiency of word segmentation guaranteed, and achieve better effect on the speed of word segmentation.Then, in the feature extraction stage, combining the characteristics of spam, facing with problems such as, the positive and negative correlation, word frequency ignoring and the low-frequency words, different contribution ability of characteristics in different location, we put forward an improved mutual information characteristics evaluation function for feature extraction. This method can greatly reduce the dimensions of the feature vector space and guarantee feature we extract form the text has a strong representative ability as well.Finally, based on the above two points, we put forward an improved naive bayesian spam filtering method based on the phrase facing Chinese, and finishing the simulation experiment. The experiment verify following results, using TRIE tree combined with maximum matching principle can improve segmentation efficiency, using basic phrases instead of words as the basic unit of the characteristics can reduce vector space dimensions, using improved characteristic evaluation function can improve the performance of the filter, using the improved naive bayesian method achieve better filtering effect on each evaluating index.

Keywords/Search Tags:

Chinese Spam filtering, Bayesian, TRIE tree, basic phrase, feature extraction

PDF Full Text Request

Related items

1	Based On Bayesian Chinese Spam Filter System Design And Implementation
2	Application Of Bayesian Classification In Spam SMS Filtering
3	Research And Implementation Of Chinese Spam Filter Technology Based On Content Mining
4	Research And Implementation Of Content-Based Spam Filter Technology
5	Research And Implementation Of Spam Pages Filtering Based On Bayesian And Decision Tree Algorithms
6	Research And Implementation Of Chinese Spam Filtering Method Based On Data Mining
7	The Research And Implement On The Chinese Anti-Spam Filtering System Based On Advanced Winnow Algorithm
8	Research And Improvement Of Chinese Spam Emails Filtering Method Based On Bayesian Classification
9	Study On Spam Filtering Technology Based Bayes
10	Research Of Chinese Spam Filtering Algorithm Based On Bayes Theory