Font Size: a A A

Research On Key Techniques Of Statistics Based Spam Identification

Posted on:2016-09-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y W WangFull Text:PDF
GTID:1228330467497548Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As the great improvement of internet technology, email has been an importantway by which people communicate with each other for the advantage that email isvery convenient and fast. In recent years, a large number of spam emails have broughtgreat damages to people’s life and the internet environment. Therefore, how to dealwith the spam effectively becomes an important research subject in internet securityfield.The traditional spam processing methods mainly contain two types: the systembased spam processing method and the content based spam identification method.Comparing with the latter, the former costs much and is restrained by the environment,thus cannot be used widely in the near future. The content based spam identificationmethod classifies the emails by analyzing the email contents, and can be classifiedfurther into two types: the role based spam identification method and the statisticsbased spam identification method. Different with the role based method, the statisticsbased method utilizes probability statistics theory based model, deducing that thegeneralization ability and the individuation ability of the statistics based spamidentification method are much higher. Actually, the spam identification is a typicalproblem of binary text classification. Considering the high dimension and highsparseness of vector space in traditional text classification methods, how to reduce thedimension of the feature space becomes a key point in spam identification fields.Moreover, spam identification is a typical technology of online applications, theresults are determined by the email users and the results are always different when thesame emails are received by a user at different time. On this basis, this paper carriesout depth researches on the aspects of vector space dimensionality reduction andonline spam identification, and the details are given as follows:1. The improved particle swarm optimization based hybrid feature selectionmethod (TFSM)Firstly, an optimal document frequency based feature selection (ODFFS) methodis used to select the most discriminative features. Secondly, a novel term frequencybased feature selection method (NTFFS) is proposed and the ODFFS method is combined to select the remaining features. In order to improve the results ofparameter optimization process, the global best oriented particle swarm optimization(GOPSO) method is proposed. The experiments are carried out on PU2, PU3,Enron-spam and Trec2007corpuses, and the support vector machine (SVM) andNavie Bayes (NB) classifiers are used for sample classification. The experimentalresults show that, the performance of TFSM is much better than other methods suchas information gain, comprehensively measure feature selection, t-test based featureselection, term frequency based information gain and improved term frequencyinverse document frequency on F1measurement.2. A two thresholds and improved harmony search based feature selectionmethod (THFS)Firstly, an ODFFS method is used to select the features if they yield the ODFFSvalues greater than a threshold th1. Secondly, an optimal term frequency based featureselection method (called OTFFS) is used to the features if they yield the OTFFSvalues greater than a threshold th2. Finally, the ODFFS method and the OTFFSmethod are combined to select the remaining features if there are still some featuresneeded to be selected. In order to search the optimal th1and th2values, the traditionalharmony search method is improve, and the best harmony considering rate (BHCR) isintroduced for solving the problem that the converging rate is very show when theglobal best value is appropriate to the actual best value. The experimental resultsshow that, THFS performs better than some typical feature selection methods whenfuzzy support vector machine (FSVM) and NB classifiers are used on PU2,CSDMC2010, PU3, Lingspam, Enron-spam and TREC2007corpuses under F1measurement.3. An online spam identification method based on user interest degreeA SVM based online spam identification method is proposed in this paper.Firstly, by combining the traditional incremental learning method and active learningmethod, the most uncertainly classified samples are selected by using the randomsampling method, and the selected samples are recommended to the user for labeling.Secondly, the conception of user interest degree is proposed, and a novel samplelabeling model and a new algorithm performance evaluating function are proposed.Finally, the user labeled samples are added into the training set by combing the the"roulette" method. Many comparative experiments show that, the spam identificationaccuracy of the proposed method is very high. The speeds of sample training and sample selection are very fast, denoting the high value of the proposed method ononline application.4. A quick online spam identification method based on user interest setIn order to improve the spam identification speed without sacrificing theaccuracy seriously, a novel online quick spam identification method is proposed.Firstly, the conceptions of user positive interest set and negative interest set areintroduced, and emails are classified by combining user interest sets and SVM.Secondly, based on the active learning theory, the sample densities of differentcategories and the improved angle diversity method are used to select the mostuncertainly classified samples, and the selected samples are recommended to users forlabeling. Finally, the labeled and the wrongly classified samples with the greatestpossiblities are put into the training set, and a novel sample value evaluating functionis proposed to filter the redundant samples for generating a new training set.Experimental results show that, the sample labeling burden of the proposed method issmall, the spam identification accuracy is high, and the spam identification speed isfast, proving the high value of the proposed method on online application.
Keywords/Search Tags:Spam identification, Feature selection, Document frequency, Term frequency, Online identification, Increamental learning, Active learning
PDF Full Text Request
Related items