Font Size: a A A

The Design Of A Spam Filtering System Based On The Topic Model

Posted on:2019-04-15Degree:MasterType:Thesis
Country:ChinaCandidate:P YangFull Text:PDF
GTID:2358330542984562Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
As a way of transmission of information,e-mail is benefited from its convenience,fast,low cost and so on.It has a large group of users,especially in enterprises,schools and government departments,and the e-mail function has been integrated into their own OA systems.However,due to the large number of spam,a series of problems have emerged.For mail service providers,spam will not only occupy a large number of network bandwidth and server storage space,but also increase the processing time of the server.For users,reading spam is not only a waste of time,but its content may bring potential danger to users.Therefore,it is very important to study the effective detection and filtering technology of spam.Firstly,this paper studies the text representation model,analyzes the principle of Boolean Model,Vector Space Model and its advantages and disadvantages.On the basis of traditional text representation model,a text representation model based on semantic analysis is introduced--Word2 vec.Considering that the LDA model can generate topic information of texts,a new mail text feature extraction algorithm is proposed based on Word2 vec and LDA models.The extracted features of the algorithm contain the deep feature information of words,such as semantics,grammar,location,etc.,and these features are more effective for the classification of text.Secondly,the traditional KNN algorithm is improved,and the previous sample is selected only in the text similar to the test sample topic.This effectively solves the problem of high time complexity of KNN algorithm when the sample size is large.In addition,the traditional SVM algorithm is optimized,and the MGD algorithm and string kernel function are introduced into the model,which not only solves the problem that the parameters in the traditional model may be trapped in the local optimal solution,but also accelerate the convergence rate of the model.The experimental results show that the improved KNN and SVM algorithms have significantly improved the accuracy and recall parameters.Finally,a mail filtering system is developed based on JavaMail,and the mail filtering algorithm based on topic model is transplanted into this system.The mail system not only provides basic functions such as email sending,receiving and email query,but also provides advanced functions such as spam detection and intelligent classification of mail.Compared with the existing mail system,it not only improves the accuracy of spam detection,but also can automatically classify the mail according to the content of the email,and facilitate the user to read it.
Keywords/Search Tags:Spam filtering, topic model, text modeling, Word2vec, LDA
PDF Full Text Request
Related items