Research On Spam Behavior Patterns And Recognition Methods

Posted on:2010-09-02

Degree:Doctor

Type:Dissertation

Country:China

Candidate:M Z Wang

Full Text:PDF

GTID:1118360302471157

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

E-mail has become one of the most common manners in modern communication. However, imperfect SMTP(Simple Transfer Protocol) protocol, especially no authentication and controlling for e-mail senders, has made spam flood.Spam filtering is a complex researching problem. Although many research has been made on that, and many achievements has gotten, but technically, there is no perfect solution can filter all the spam. With the development of camouflage technology , spam became more obscure, and lead to higher false positive rate for content-based filtering. For large number of suspected spam, content-based filters also spent so much time on processing. Therefore, we must find new methods and algorithms to solve the problem.The framework of spam filtering system based on mining behavior patterns is proposed. Extracting behavior features from collected data, and dividing behavior features into session features, message header features and statistical features, using feature selection algorithm to choose the features that can effectively predict training data class attribute, and after data preprocessing, knowledge of spam behavior determinant rules can be mined from the training data.A model of spam behavior patterns mining is proposed, and it is based on multi-level structure. For different types of behavioral features, different pattern mining algorithms have been used: for session features in MTA(Mail Transport Agent) stage, using Decision Tree for spammers' behavior recognition. It needn't to receive the entire message, and mines behavior patterns from features in the conversation, spam can be filtered in the early time of the session. Histogram distance method is used for user sending behavior to detect the abnormal sending behavior. Fingerprint features and statistical features of attachments are calculated to generate the feature vector, and Support Vector Machine model(SVM) used to model attachment behavior. By calculating URL(Uniform Resource Locator) similarity between URLs, similar URLs are grouped to URL clique. The minimum distance between the sample and other URL cliques is converted into the confidence level as the classifier output to determine spam behavior.A collaborative filtering model based on Bayesian algorithm is proposed, and the model correlates the results of the various models. Because traditional Bayesian spam filtering technology hasn't concerned about the loss of spam false negatives and false positives, an improved Bayesian algorithm is proposed. In the algorithm, the loss factor is introduced in the situation of no reducing the accuracy rate of filtering, to minimize the risk of spam false positives. If choosing the appropriate loss factor, the accuracy rate and the recall rate can be improved to ideal result. By comparing the performance with the new combining Bayesian model, the attachment model, the user sending behavioral model and URL model, corresponding to the single models, the improved Bayesian combining model can greatly improved the filtering ability.A classification method based on fuzzy decision tree is proposed. Because the absolutely clear attributes do not always exist in the real world, the attribute subordinating degree is more natural and reasonable to describe the characteristics of behavior, so corresponding to clear decision tree, the fuzzy decision tree is more suitable. Fuzzy decision tree algorithm expands the scope of application of decision tree, and can handle uncertainty. It can deal with the inaccurate information in the process of learning and influence with stronger classification ability and robustness. It can generate rules with different level and different confidence degree, and provide decision makers with full determinate information.Based on the combining technology of behavior-based pattern recognition and other e-mail filtering technology, the filtering system MailGate is designed and implemented. Experiments show that the recall rate and FP rate of spam filtering get a good result.

Keywords/Search Tags:

spam, behavior recognition, data mining, decision tree, fuzzy decision tree, Naive Bayes, collaborative filtering, SVM

PDF Full Text Request

Related items

1	Research On Filtering Technology Of Spam Communication Behavior Detection Based On Decision Tree Algorithm
2	Research On The Filtering Of Spam Based On Behavior Recognition
3	Research And Application Of Behavior Recognition Technology In Anti-Spam System
4	The Research On An Improved Algorithm For Incremental Induction Of Decision Tree
5	Research On Personal Credit Evaluation Based On Decision Tree Integration Algorithm
6	Research And Implementation Of Spam Pages Filtering Based On Bayesian And Decision Tree Algorithms
7	The Research And Application Of Decision Tree Based On Fuzzy Theory
8	Research On Hybrid Classification Based On Navie Bayes And Decision Tree
9	Application Of Various Classification Methods In Spam Message Recognition
10	Comparing Classifiers In Data Mining