Font Size: a A A

Research On Chinese Spam Filtering Method

Posted on:2017-01-29Degree:MasterType:Thesis
Country:ChinaCandidate:R Y WeiFull Text:PDF
GTID:2348330482999743Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
An increasing number of spam has brought great inconvenience to people's daily life. In terms of our country, such a large population with a great number of emails to send and receive, more resources will be wasted in dealing with spam.Naive bayes algorithm has been widely used in the area of spam filtering with its fast computing speed and easily implemention. In the process of filtering, word segmentation and feature extraction are two very important phases. At present, in most spam filtering methods for Chinese, word segmentation process is often very complex; when faced with a mass email the training sample, with the characteristic of the words as a text item unit, time efficiency of this algorithm will become a bottleneck in the mail filtering technology; On the other hand,in the feature extraction of the characteristics of the existing evaluation function is not completely accords with the characteristics of spam, Represent ability of the extracted characteristics is not strong enough. For this problem, in this paper, in order to improve the filterability anti-spam for the goal, to do a thorough research,Main work is as follows:In the segmentation stage of pretreatment, we use TRIE tree structure as dictionary carrier, combing with the positive maximum matching principle,then combine with phrases analysis methods which is proposed in text categorization, using limited semantic analysis such as basic noun phrases, verb phrases to convert vector space model from the words pattern in to basic phrases pattern. This method can make the segmentation precision and the efficiency of word segmentation guaranteed, and achieve better effect on the speed of word segmentation.Then, in the feature extraction stage, combining the characteristics of spam, facing with problems such as, the positive and negative correlation, word frequency ignoring and the low-frequency words, different contribution ability of characteristics in different location, we put forward an improved mutual information characteristics evaluation function for feature extraction. This method can greatly reduce the dimensions of the feature vector space and guarantee feature we extract form the text has a strong representative ability as well.Finally, based on the above two points, we put forward an improved naive bayesian spam filtering method based on the phrase facing Chinese, and finishing the simulation experiment. The experiment verify following results, using TRIE tree combined with maximum matching principle can improve segmentation efficiency, using basic phrases instead of words as the basic unit of the characteristics can reduce vector space dimensions, using improved characteristic evaluation function can improve the performance of the filter, using the improved naive bayesian method achieve better filtering effect on each evaluating index.
Keywords/Search Tags:Chinese Spam filtering, Bayesian, TRIE tree, basic phrase, feature extraction
PDF Full Text Request
Related items