Font Size: a A A

Research On Spam Behavior Patterns And Recognition Methods

Posted on:2010-09-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:M Z WangFull Text:PDF
GTID:1118360302471157Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
E-mail has become one of the most common manners in modern communication. However, imperfect SMTP(Simple Transfer Protocol) protocol, especially no authentication and controlling for e-mail senders, has made spam flood.Spam filtering is a complex researching problem. Although many research has been made on that, and many achievements has gotten, but technically, there is no perfect solution can filter all the spam. With the development of camouflage technology , spam became more obscure, and lead to higher false positive rate for content-based filtering. For large number of suspected spam, content-based filters also spent so much time on processing. Therefore, we must find new methods and algorithms to solve the problem.The framework of spam filtering system based on mining behavior patterns is proposed. Extracting behavior features from collected data, and dividing behavior features into session features, message header features and statistical features, using feature selection algorithm to choose the features that can effectively predict training data class attribute, and after data preprocessing, knowledge of spam behavior determinant rules can be mined from the training data.A model of spam behavior patterns mining is proposed, and it is based on multi-level structure. For different types of behavioral features, different pattern mining algorithms have been used: for session features in MTA(Mail Transport Agent) stage, using Decision Tree for spammers' behavior recognition. It needn't to receive the entire message, and mines behavior patterns from features in the conversation, spam can be filtered in the early time of the session. Histogram distance method is used for user sending behavior to detect the abnormal sending behavior. Fingerprint features and statistical features of attachments are calculated to generate the feature vector, and Support Vector Machine model(SVM) used to model attachment behavior. By calculating URL(Uniform Resource Locator) similarity between URLs, similar URLs are grouped to URL clique. The minimum distance between the sample and other URL cliques is converted into the confidence level as the classifier output to determine spam behavior.A collaborative filtering model based on Bayesian algorithm is proposed, and the model correlates the results of the various models. Because traditional Bayesian spam filtering technology hasn't concerned about the loss of spam false negatives and false positives, an improved Bayesian algorithm is proposed. In the algorithm, the loss factor is introduced in the situation of no reducing the accuracy rate of filtering, to minimize the risk of spam false positives. If choosing the appropriate loss factor, the accuracy rate and the recall rate can be improved to ideal result. By comparing the performance with the new combining Bayesian model, the attachment model, the user sending behavioral model and URL model, corresponding to the single models, the improved Bayesian combining model can greatly improved the filtering ability.A classification method based on fuzzy decision tree is proposed. Because the absolutely clear attributes do not always exist in the real world, the attribute subordinating degree is more natural and reasonable to describe the characteristics of behavior, so corresponding to clear decision tree, the fuzzy decision tree is more suitable. Fuzzy decision tree algorithm expands the scope of application of decision tree, and can handle uncertainty. It can deal with the inaccurate information in the process of learning and influence with stronger classification ability and robustness. It can generate rules with different level and different confidence degree, and provide decision makers with full determinate information.Based on the combining technology of behavior-based pattern recognition and other e-mail filtering technology, the filtering system MailGate is designed and implemented. Experiments show that the recall rate and FP rate of spam filtering get a good result.
Keywords/Search Tags:spam, behavior recognition, data mining, decision tree, fuzzy decision tree, Naive Bayes, collaborative filtering, SVM
PDF Full Text Request
Related items