Accompanied by the rapid development of the Internet, the number of Internet users in China has reached 338 million, ranking first in the world. At the same time, the ever-increasing spam problem has aroused peoples' general concern, because it not only brings much trouble to people's life and work, but also causes huge losses to all the social economy. So a variety of web spam detection and filtering methods came alone with it and developed rapidly till today. However, more and more common and constantly updated spam obfuscation strategies as well as the means of spam attack, is seriously impact on the effectiveness and practicality of these type of detection methods.Currently, with the premise of causing no difficulties to its readability to humans, spam often hide its sensitive feature to interfere the feature mining and selection process of spam filters by means of insertion, replacement, coding and other modifications to the message conten, which result in the accuracy of some effective detection method implemented in the practical application is not high. Besides, spam message belongs to the dynamic data flow, but it is treat as a static text classification by most of the traditional type of feature selection and spam detection methods which cannot timely reflect the features are dynamically changing as time goes by. In addition, the spam detection is a large-scale and real-time data process. Although part of the machine-learning based methods has good accuracy in spam detection, it could not be a good use in the practical application due to its efficiency bottlenecks in the model updating and rapid detection process.Anyway, the developing situation of spam form and obfuscation technologies suggest that the ever-increase spam problem can be resolved by not only integrating the use of multiple detection method but also doing some innovation to make it better, and fully mining spam feature which is advance with the times. In the meanwhile, spam detection technology is also need to balance the accuracy and efficiency in order to meet the requirement of the large-scale data processing in the practical application.By tracing the latest progress in spam detection technologies, a more comprehensive web spam detection technology development status overview and the comparisons among various popular spam detection approaches are made. According to the comparisons, we put forward the problems of spam detection to be solved in need.In view of the above problem, this dissertation makes the intensive research and innovation to the key technology on spam detection method from two aspects including theory and. application, which based on the latest technologies research in this area. The main research results and innovations can be concluded as follows:(1)For the current spam obfuscation techiniques,we summarized the hidden behavioral features from the email header through a statistics and analysis of large-scale real emails. On the basis, a bovel email header-enhanced feature selection method is proposed to combine with the content feature which is in the form of fingerprints vectors. Experimental results show that this feature selection and repsentaion method implemented in Bayesian filter could strengthen the detection capacity of the Bayesian filter in response to the spam variant. And the calculation of this method is quite simple, so it is suitable for the large-scale application(2)The spam message is put forward to regard as dynamic data stream for processing. Considering the life cycle of spam feature and the use of frequency, a novel features selection method for spam detection using statistical time sequence is disgned. This method not only effectively reduces the redundant features but also better reflect the feature changes dynamically as time goes by. In addition,based on the time series prediction model ,a dynamical turning method of the filter threshold is proposed to help it to make relevant to the scale of spam and to be adaptive to the different periods of spam intensity. This time sequences based spam feature selection method helps to improve the new spam variant detection capacity of spam filter, reduce the storage space and increase the running efficiency in spam detection.(3)Naive Bayesian is the most popular method in spam detection, its core knowledge derived from the posterior probability, that is based the assumption of all attributes independent of each other, which is not fit for the spam reality. A new detecting method called SAODE (Structural-AODE) based on Averaged One-Dependence Estimators is proposed to weaken this assumption and construct new features of properties of Bayesian method according to the structural spam feature. And we suggested an optimized feature selection method based on the class conditional distribution and an active learning model based the maximal and minimal entropy. These optimization methods guarantee the lower time consumption and high accuracy of computing. Experimental results show that SOADE method can further enhance the accuracy and efficiency of the traditional Naive Bayesian methods.(4)For the two key problems of high complexity and low efficiency of Support Vector Machine (SVM) method using in spam detection, an online SVM detection method based on the Sequential Minimal Optimization (SMO) decomposition algorithm is proposed. This method suggests a risk-based supervisory training model so that the parameters of the method can be self-adaptive adjustment and anchives a cost-sensitive SVM detection learning mechanism according to the specific rules. All of these strategies improve the efficiency of spam detection of SVM method in large-scale practical application while keeping its accuracy.All of above methods/techniques are evaluated on the standard spam dataset from TREC2007,SEWM2008 and CEAS2008, and as the participant in some domestic and international spam filter evaluation, which compare to some popular spam filter like bogofilter and so on. The evaluated results shows that the methods of improvement and innovation we presented could be more effective resolve the key problem like the spam obfuscation and running efficiency which encountered in the practical application in spam detection. |