Design And Implementation Of Content Based Spam Filtering System

Posted on:2019-07-21

Degree:Master

Type:Thesis

Country:China

Candidate:F Tao

Full Text:PDF

GTID:2348330542455583

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet,e-mail has become a part of our life because of its simple operation and quick communication.However,many organizations or individuals now use e-mail to facilitate and release a large amount of spam.Nowadays,the problem of spam is getting more and more serious.It not only takes up a lot of network bandwidth,but also consumes a lot of time of Internet users.Therefore,there is an important application requirement for spam filtering.The most common classification algorithms used in spam classification are Naive Bayes,neural networks,K-nearest neighbors,support vector machines(SVM)and so on.Because mail classification algorithms are based on the establishment of feature extraction,feature extraction directly affects the classification of mail.According to the research of scholars,effective algorithms for extracting the features of e-mail content are: document frequency,information gain,mutual information,expectation cross entropy,textual evidence,CHI statistics and TFIDF.TFIDF is widely used due to its advantages such as easy to understand,simple operation and low time complexity.The algorithm also has some shortcomings.This method only considers the absolute number of feature words and the word frequency of feature words in certain types of e-mail,Without considering the distribution of feature words in a class and the frequency of feature words in other types of mail,the effect of low-frequency words is overestimated and the effect of high-frequency words underestimated.This thesis will further study and compare the existing spam filtering technology,from the mail preprocessing,Chinese word segmentation,feature extraction and classifier perspective.After comparing many kinds of feature extraction algorithms,the thesis chooses to modify and optimize the traditional TFIDF algorithm.By reducing the influence of frequently appearing feature words in the special case mail,this thesis introduces the frequency difference and analyzes the frequent occurrence and frequency The weight of small entries increases and decreases.The final experimental results show that the improved method can select a more suitable feature set,so as to make the mail classification better and achieve a more effective spam filtering effect.

Keywords/Search Tags:

Mail filtering, Word frequency, Feature extraction, Classifier, Weight

PDF Full Text Request

Related items

1	Research On Chinese Spam Filtering Technology
2	The Research On Intelligent Agent For Mail Server
3	Mail Message Filtering Algorithm
4	Svm-based Spam Filtering
5	Content-based E-mail Filtering System
6	Based On Multi-feature Fusion Spam Filtering System
7	Design And Implementation Of Mail Filtering System Based On Text Mining
8	Application And Research Of Information Filtering Technology In Website Information Supervision
9	Research On Weighted Bayesian Mail Filtering Method
10	Research On Discriminant Feature Extraction Of Human Face And Classifier Design