Font Size: a A A

Research And Research On Spam Message Identification

Posted on:2019-07-21Degree:MasterType:Thesis
Country:ChinaCandidate:W W LiFull Text:PDF
GTID:2428330611472439Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
With the development of communication tools,SMS is becoming more and more popular because of its low cost,convenient delivery and mobility.Though it has brought great convenience to people,meanwhile,many spam messages are becoming more and more serious.Advertising information,fraudulent short messages,rumors and other short message contents seriously harm public security.However,the laws and regulations on the supervision of spam messages are not perfect at present.Many bad organizations take the edge ball of the law to collect cell phone information and sell them to the business and business people in need,which seriously endangers the user's information security and normal life.Everyone can see all kinds of sales promotion,promotional activities and some other information such as winning,cheating and so on,which is very tiring.In order to effectively improve the above problems,a more accurate identification method of spam messages is proposed.In this paper,800 thousand SMS data are analyzed and modeled.Based on the text content of the text message,the identification model is established to identify the spam messages accurately,so as to solve the problem of spam filtering.First,through the analysis of text text content,it is found that the disequilibrium of data distribution and the existence of many meaningless information in the text content,such as repeat characters,desensitization characters,Mars and so on,will have a great influence on the accuracy of text classification.Then,the unbalanced data set is solved by simple random sampling and undersampling,and the text is compressed,repeating and desensitization characters,Chinese word segmentation,and adding custom dictionary elimination stop words to filter out the useless information in the text.Then,we construct the spam message and the normal message word frequency matrix and make a preliminary visualization of the word cloud structure of the data,display the important components of the junk short message and normal short message intuitively,and compare the differences between the spam and the normal message.Finally,through the LDA theme analysis algorithm to analyze the theme of spam messages,some categories of spam messages are summed up.In the classification of spam messages,the initial KNN algorithm tries to get the correct rate and the classification effect is not ideal.After consulting a large number of data,the simple Bias model(Naive Bayesian Model,NBM)is further applied.On the basis of classification and improvement,the prediction results of the data from the sample are obtained.The comprehensive correct rate is above 95%,and the accurate classification of spam messages is realized.The main innovation of this paper is to improve the naive Bayesian model,and use the addition smoothing(additive smoothing,Laplace smothing)to solve the zero probability problem and to avoid the multiplicative spillover risk by the conditional probability continuous multiplying logarithm method,which greatly improves the accuracy of the classification.The accurate prediction of spam message identification obtained by this method can provide reference help for operators to accurately detect spam messages,and also solve the problem of users.
Keywords/Search Tags:Naive Bayes, Word Cloud, Text Participle, LDA topic analysis algorithm, Addition smoothing
PDF Full Text Request
Related items