Font Size: a A A

Research Of Content-Based Spam Filtering

Posted on:2011-07-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:J T SunFull Text:PDF
GTID:1118330335467136Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
With rapid development of Internet, the growing problem of junk mail (also referred to as"spam") has caused wide public concern. Today, many means can be applied to resolve the problem of spam. Contend-based spam filtering is one of the mainstream technologies used so far. The approach is using automated text categorization and information filtering to get spam. With the domestic status quo of the proliferation of spam, Chinese spam filtering related technologies are viewed as research direction and they have been researched deeply in this article, which they include the LSA (Latent Semantic Analysis) method, Mail Fingerprint (MF) generated strategy, based on FSVM (Fuzzy Support Vector Machine) spam filtering methods, FCA (Fuzzy Clustering Algorithm) in the spam filtering approach and the application of GT (Game Theory)-based feature selection etc.1. Because of the succession of vector space model and the neglect of its own characteristics, they result in lack of prior information of documentation and global information of document implanting, which making it over mechanical in practical applications, So we adopt our own way to solve the problem that appears in the calculation of the weight by LSA. A new weighting function is presented to improve the way of the definition of the original right, which can construct a more suitable spam filtering model for practical application by using active learning method of potential semantic analysis. In a large local area network of thousands of users, most of spam mails adopt the way of change the subject or sender addresses dynamically to spread in the net by the way of mass-generated spam. So we use MD5 (Message-Digest Algorithm 5) algorithm, based on the analysis of the LSA, to construct "e-mail fingerprints" for mass-generated spam, which can dealing with inefficient problems of the traditional filtering techniques in mass-based spam so as to further improve the accuracy of spam recognition.2. Base on the study of the classifying method of FSVM, we present a novel approach which is a method of combination of LSA and FSVM by using the FSVM to analyze the characteristics of Chinese spam mail. Selection of membership function is also largely based on specimens from the class to measure the distance between the centers of their membership size, which neglects blending issues of the sample with the class. According to the special requirements of the Chinese spam identifying, the level of integration of the sample and the class is introduced to expand the definition of membership function, which can make the method of FSVM to be more in line with this specific application requirements of the Chinese spam filtering. All of these are based on the original Distance-based definition of membership function. Detailed experiments show that effectiveness of this method in the spam recognition is validated.3. To deal with spam filtering efficiently and accurately without too much prior knowledge, fuzzy cluster analysis method widely used in the field of the text classification is adopted to realize spam unsupervised recognition in this text. And aim at treating with large-scale data by the method of fuzzy cluster analysis, factor analysis method is proposed to simplify the characteristics of the message contents of the sample sets, and it also retains the semantic information content of the original message at the same time, which will simplify information on the complexity of fuzzy clustering analysis and enhance the follow-up operability. Doing the experiment by selected test set of the e-mail, DCAFEM is used to cluster the spam right after pretreating the samples, which can calculate focal point of the various types and adopt it to finish spam identification. And it also confirmed that the method can greatly improve the accuracy of spam filtering and the ability to identify unknown spam.4. Through analyzing the shortcomings of the method of content-based spam filtering used in feature selection, we study how to select the best characteristics of sampling points in mail feature space for mail classifications, which reduces the space complexity of the spam filtering methods used to deal with the problem and improve accuracy of spam recognition. We use the weight of memberships and feature points belonging to the sample sets to define the extent of the distinction between categories so as to achieve the elimination of noise characteristics and improve spam filtering performance. Using game theory to establish the model of spam feature selection model and select the best feature subset among sample sets, thereby it reduces the number of feature samples and make the feature points fully reflect the message content information. At the same time it also can improve the recognition efficiency of spam filtering method. By CCERT Data Sets of Chinese Emails (CDSCE) corpus on the experimental results show that the method used in this paper can make mail filtering performance improved significantly.
Keywords/Search Tags:Chinese Spam Filtering, Latent Semantic Analysis, Message-Digest Algorithm 5, Fuzzy Support Vector Machines, Factor Analysis, Fuzzy Cluster Analysis, Email Feature Selection
PDF Full Text Request
Related items