Font Size: a A A

Chinese Spam Filtering Based On Cross Cover Algorithm

Posted on:2008-01-29Degree:MasterType:Thesis
Country:ChinaCandidate:Q Q WangFull Text:PDF
GTID:2178360215496537Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The development of the Internet brings us totally new network experience. Among these developments e-mail technology has become a quick, economical communication method. Although e-mail brings us facility, it is becoming an important carrier of advertisement, virus, baleful program and bad information. This brings inconvenience to our lives and extremely bad impact to the security of the network. Solving the problem of spare is urgent.There are many kinds of methods that can be used to solve the problem of spam. Spam filtering is one of the mainstream methods by far, which include IP-based spare filtering, keyword-based spam filtering and content-based spam filtering. All these methods have just considered some information of the e-mail. The dissertation gives out the analysis about the methods mentioned above. According to the analysis, the paper summarizes the advantages of these methods and proposes to consider mail address, keyword and context simultaneously at the time of filtering spam.The major contribution of this dissertation can be summed up in six points:1) After carefully analyzing the format of the e-mail, the paper discusses in detail how to realize reception of the e-mail in the environment of VC.2) In this dissertation, we make an improvement on traditional content-based spam filtering. We consider mail address, subject, attachment and content at the same time and take these factors as the attributes of the classifier. Experiments indicate that these attributes have great impact on the result of spam filtering.3) In order to reduce dimensions of attribute vectors, we use several feature reduction methods which are usually used in text categorization (Document-Frequency, x~2 statistic, Multi-Information, Information-Gain, Expected-Cross-Entropy, Weight-of-Evidence-for-Text) to do experiments separately. According to the results, x~2 statistic and Expected-Cross-Entropy are the most useful methods to reduce dimensions. Document Frequency and Weight of Evidence for Text are less effective, while Multi-Information and Information Gain are the least effective of all. 4) After obtaining the attributes of the e-mail, we need to find an appropriate mean of classification. This article is the first to adopt cross cover algorithm which was propounded by Ling Zhang and Bo Zhang to filter spam. In the experiments, we compare the result of using cross cover algorithm as classifier with the result of using SVM. The experiments prove cross cover algorithm is an excellent classifier, which can filter spare effectively with a high correction rate.5) Risk exists in the spam filtering, in that the receiver of the e-mail prefers getting more spam to missing normal mail. We discuss the classification process of cross cover algorithm from the perspective of possible risk. According to the result of analysis, we propose an improvement of one process in the handling of "rejection" samples by employing cross cover algorithm. So we can reduce the risk by changing the area which is affected by normal mail.6) Different pattern recognition methods have different advantages and disadvantages. Guided by the philosophy that the minority is subordinate to the majority, we discuss the feasibility of constructing a voting email model based on multiple pattern recognition methods.Issues to be further analyzed are as follows:1) This passage focuses on spare written in Chinese, we can extend our research to cover spare written in other language.2) The main type of mail discussed here is letter-type, but the technology of email is developing. Existing email will have more types. How to get more useful information from many types of email requires more attention in future analysis.3) We can work further in the spam filtering based on multiple pattern recognition methods.
Keywords/Search Tags:Spam, Feature Dimension Reduction, Pattern Recognition, Text Categorization
PDF Full Text Request
Related items