Content-based Indexing Of Spam Filter Research And Implementation

Posted on:2012-07-09

Degree:Master

Type:Thesis

Country:China

Candidate:Y Wang

Full Text:PDF

GTID:2178330341950169

Subject:Computer application technology

Abstract/Summary:

In information age, spam is considered as the most effective and inexpensive internet advertisement forms. Some speculators distribute illegal information by using the spams. Spams greatly interfere people's daily life, consume their energy and time. More serious influences are: bringing information security hidden trouble, ruining ISP's (Internet Service Provider) market figure, loosing the intangible assets, and resulting in more and more danger to network resources and network security.This thesis mainly studies the Mail User Agent filter (the client filter). The author expanded the E-mail spam filter function of agent tool Microsoft Outlook, in order to achieve user's custom settings. Nowadays, the main approach of spreading spam is in a letter form, so this thesis focuses on content-based indexing of spam filter. Its basic processes can be briefly divided into two steps operation: training phase and classifying phase. The two phases also include five main steps: E-mail pretreatment, text representation, feature selection, classified prediction, and evaluation of E-mail filter quality. This thesis mainly research on feature selection and classification prediction that are spam filter's core steps. Firstly, we analyses eight kinds of common feature selection methods, they are: document frequency, information gain, mutual information, CHI statistics, expected cross entropy, the weight evidence for text, odd ratio and relevance score. Secondly, after deeply studying mutual information method, we have found that when feature words appeared in only one class, their mutual information values are equal with each other. It will lead to that the importance of feature words can't be distinguished. Started from this point, the improved mutual information is proposed. This new method uses the adjustment TFIDF weight function to balance the feature words'importance. Finally, this thesis researches two classification algorithms: Bayesian classifier and support vector machine (SVM). In the experimental analysis section,we selected the standard Ling-Spam mail collection. Experiments are made to compare the algorithms in four aspects: different feature selection methods, dimension, classifier and training set number, using F1 value and false ratio to evaluate. The results show that the stability of improved mutual information method is better than other algorithms. At last, using Microsoft Outlook's external program, we implement a spam filter system which can satisfy the function of spam filter.

Keywords/Search Tags:

Spam, Feature Selection, Mutual Information, Classify, Bayesian

Related items

1	Two Feature Selection Algorithms Based On Mutual Information And Bayesian Optimization
2	Research On Feature Selection Algorithm Of Spam Filtering
3	Research On Chinese Spam Filering Technology Based On Content Mining
4	Research And Implementation Of The Anti-spam System Based On Bayesian Algorithm
5	Research On Content-Based Spam Filtering Technology
6	Study On Spam Filtering Technology Based On IMI-WNB Algorithm
7	The Research On Spam Feature Selection And Detection Method
8	The Research Of Bayesian Classifier And Its Applications
9	Research On Content-Based Spam Filtering Technology
10	Research On Spam Filtering Technology Based On Bayesian Classification