Font Size: a A A

Research On Recognition Of Spam SMS Based On Binary Mixed Features Of Text Content

Posted on:2018-06-24Degree:MasterType:Thesis
Country:ChinaCandidate:C C ShangFull Text:PDF
GTID:2428330518455051Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
With the advent of the era "Big Data",continuous development of communication technology and various intelligent terminal technology constantly updated make Spam SMS technology constantly update with low cost,so the SMS,as a kind of communication carrier used by people frequently,has become the focus of people driven by profits and criminals.According to the survey report of 12321 Reporting Center,an authority to report and address Spam SMS,the average number of spam messages received for users generally maintained about 10 per week,and almost all people have been troubled by spam messages.Although carrieroperators and government departments have also proposed a series of measures to govern Spam SMS and achieved some successes to an extent,but the updating of technology make the Spam SMS have the characteristics of diversity and time limited.So recognition and management of Spam SMS is a long-term and formidable task and has important practical significance.Firstly,based on part of the SMS dataset released by China Mobile in 2015 Big Data Contest,this thesis takes text categorization technology of Spam SMS as the research object,introduces the current development situation of SMS,the serious consequences of spam SMS flooding,the characteristics and types of spam messages,the achievements in management of spam messages and the significance of research on Spam SMS recognition technology.Besides,this thesis also shows the research status at home and abroad on Spam SMS recognition and government.Secondly,this thesis introduces the traditional and content-based recognition technology of Spam SMS,including Chinese word segmentation?text preprocessing?text representation and features selection.In addition,this thesis also shows two excellent classifiers:Support Vector Machine and Random Forest,which can be used to identify the spam messages in dataset released by China Mobile,but the results of classification need to be improved,the reasons for the poor performance of classifiers are the problems of imbalanced data and semantic sparsity caused by short text content.Thirdly,this thesis improves some common methods to address the imbalanced data.Due to the special own characteristics of dataset,this thesis uses LDA topic model to explore topic structures for the normal SMS,and uses K-means method to cluster the text-topic distribution drived from the LDA model,then samplings the samples from each subcategory randomly according to a certain proportion and gets the relatively balanced dataset.At last,this thesis selects the features and builds the classifier of random forest for the relatively balanced dataset and the recognition rate of Spam SMS has been improved to a certain extent.Lastly,this thesis proposes a method based on statistical properties of text content for feature extension aiming at solving the problems of short text and semantic sparsity,and these new features and text features of SMS together constitute the Binary Mixed Features(BMF),and build random forest classifier based on BMF and LDA-Kmeans algorithm,which improves the performance of the classifier greatly and makes the result of classification better.At the end of this thesis,I make a summary and put forward the research direction of Spam SMS recognition in the future.
Keywords/Search Tags:Spam SMS, Text categorization, Imbalanced data, Semantic sparsity, Binary Mixed Features
PDF Full Text Request
Related items