Font Size: a A A

Design And Implementation Of Content Based Spam Filtering System

Posted on:2019-07-21Degree:MasterType:Thesis
Country:ChinaCandidate:F TaoFull Text:PDF
GTID:2348330542455583Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,e-mail has become a part of our life because of its simple operation and quick communication.However,many organizations or individuals now use e-mail to facilitate and release a large amount of spam.Nowadays,the problem of spam is getting more and more serious.It not only takes up a lot of network bandwidth,but also consumes a lot of time of Internet users.Therefore,there is an important application requirement for spam filtering.The most common classification algorithms used in spam classification are Naive Bayes,neural networks,K-nearest neighbors,support vector machines(SVM)and so on.Because mail classification algorithms are based on the establishment of feature extraction,feature extraction directly affects the classification of mail.According to the research of scholars,effective algorithms for extracting the features of e-mail content are: document frequency,information gain,mutual information,expectation cross entropy,textual evidence,CHI statistics and TFIDF.TFIDF is widely used due to its advantages such as easy to understand,simple operation and low time complexity.The algorithm also has some shortcomings.This method only considers the absolute number of feature words and the word frequency of feature words in certain types of e-mail,Without considering the distribution of feature words in a class and the frequency of feature words in other types of mail,the effect of low-frequency words is overestimated and the effect of high-frequency words underestimated.This thesis will further study and compare the existing spam filtering technology,from the mail preprocessing,Chinese word segmentation,feature extraction and classifier perspective.After comparing many kinds of feature extraction algorithms,the thesis chooses to modify and optimize the traditional TFIDF algorithm.By reducing the influence of frequently appearing feature words in the special case mail,this thesis introduces the frequency difference and analyzes the frequent occurrence and frequency The weight of small entries increases and decreases.The final experimental results show that the improved method can select a more suitable feature set,so as to make the mail classification better and achieve a more effective spam filtering effect.
Keywords/Search Tags:Mail filtering, Word frequency, Feature extraction, Classifier, Weight
PDF Full Text Request
Related items