Research And Research On Spam Message Identification

Posted on:2019-07-21

Degree:Master

Type:Thesis

Country:China

Candidate:W W Li

Full Text:PDF

GTID:2428330611472439

Subject:Applied statistics

Abstract/Summary:

With the development of communication tools,SMS is becoming more and more popular because of its low cost,convenient delivery and mobility.Though it has brought great convenience to people,meanwhile,many spam messages are becoming more and more serious.Advertising information,fraudulent short messages,rumors and other short message contents seriously harm public security.However,the laws and regulations on the supervision of spam messages are not perfect at present.Many bad organizations take the edge ball of the law to collect cell phone information and sell them to the business and business people in need,which seriously endangers the user's information security and normal life.Everyone can see all kinds of sales promotion,promotional activities and some other information such as winning,cheating and so on,which is very tiring.In order to effectively improve the above problems,a more accurate identification method of spam messages is proposed.In this paper,800 thousand SMS data are analyzed and modeled.Based on the text content of the text message,the identification model is established to identify the spam messages accurately,so as to solve the problem of spam filtering.First,through the analysis of text text content,it is found that the disequilibrium of data distribution and the existence of many meaningless information in the text content,such as repeat characters,desensitization characters,Mars and so on,will have a great influence on the accuracy of text classification.Then,the unbalanced data set is solved by simple random sampling and undersampling,and the text is compressed,repeating and desensitization characters,Chinese word segmentation,and adding custom dictionary elimination stop words to filter out the useless information in the text.Then,we construct the spam message and the normal message word frequency matrix and make a preliminary visualization of the word cloud structure of the data,display the important components of the junk short message and normal short message intuitively,and compare the differences between the spam and the normal message.Finally,through the LDA theme analysis algorithm to analyze the theme of spam messages,some categories of spam messages are summed up.In the classification of spam messages,the initial KNN algorithm tries to get the correct rate and the classification effect is not ideal.After consulting a large number of data,the simple Bias model(Naive Bayesian Model,NBM)is further applied.On the basis of classification and improvement,the prediction results of the data from the sample are obtained.The comprehensive correct rate is above 95%,and the accurate classification of spam messages is realized.The main innovation of this paper is to improve the naive Bayesian model,and use the addition smoothing(additive smoothing,Laplace smothing)to solve the zero probability problem and to avoid the multiplicative spillover risk by the conditional probability continuous multiplying logarithm method,which greatly improves the accuracy of the classification.The accurate prediction of spam message identification obtained by this method can provide reference help for operators to accurately detect spam messages,and also solve the problem of users.

Keywords/Search Tags:

Naive Bayes, Word Cloud, Text Participle, LDA topic analysis algorithm, Addition smoothing

Related items

1	Design And Implementation Of Short Message Classification System Based On Naive Bayesian
2	Research And Improvement On Na(?)ve Bayes Test Classifier
3	Correlation Between The Text Classification. Word
4	Text Categorization Based On Naive Bayes Method
5	Research On Text Classification Algorithm Based On Naive Bayes Method
6	Analysis Of Laptop Network Scoring Based On Text Mining
7	Research And Application On Naive Bayes Classification Algorithm
8	A Text Classifier About High Blood Pressure Based On Naive Bayes
9	Chinese Participle Algorithm Research Based On Word Table Structure
10	Research On The Methods Of Chinese Text Classification Using Bayes And Language Model