Font Size: a A A

Research And Implementation Of SMS Spam Filtering System Based On Spark

Posted on:2017-12-16Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:2348330518495976Subject:Information security
Abstract/Summary:PDF Full Text Request
In recent years,along with the popularization of network and the rapid growth of cellphone users,the flood of spam messages has become a prominent problem which strongly affected people's lives and disturbed social orders.In order to solve this problem,the government instituted laws and regulations to clamp down spam messages,and the major telecommunication operations also took measures to deal with it.Currently,the filter system of spam messages often used includes black and white list,filtering based on the algorithms of message length and sending frequency,and filtering based on text classification technology.A single filtering method can only work on a certain type of spam messages.However,with the increased number of messages,these filtering technologies didn't seem to perform well.To tackle the above problem,this thesis focuses on the filter system based on Spark.It uses joint filter and processes spam messages in parallel so that the filter effects can be improved so is the ability to deal with a large amount of data.The following points are discussed and analyzed in paper:1.A general review of the background of spam messages,the definition and classification of spam messages,and the summarization of the characteristics and harms of spam messages are given.The current situation of dealing with spam messages in China and abroad is explained in details.This paper gives an in-depth study on text classification technology including text preprocessor,cleaning and denoising,feature dimension reduction and text classification algorithms.I also introduce Simhash technology,research the Hadoop and Spark platform and analyze their operation principles.2.A serial filter system is designed and implemented.First a demand analysis is conducted on the system,then the modules is designed in details and coded.The serial filter system mainly consists of message processing module,characteristic identification module,Simhash module and Bayesian Classifier module.It resolves the problem that a single filter cannot cover all message types.With the innovative introduction of Simhash algorithm,it can not only improve the filtering effects,but also increase the classifying speed of Bayesian Classifier by compressing sample database.3.Based on the advantages of the Spark platform,the parallel optimization is adopted on filter system including the characteristic identification module and Bayesian Classifier module.This paper introduces in details the designing principles of parallel optimization including expansibility,efficiency and parallel.According to the designing principles,a concurrent design for the modules needed to be optimized is conducted and then coded to implement it.Moreover,the method of parallel extraction to simplify the sample database is used,and the policy repository is established to tackle oversized samples and the problems that affect filter efficiency.The modules are experimented by setting up the platforms,and results are analyzed to get a conclusion.The experiment results show that the filter system of spam messages based on Spark has an outstanding capacity of efficiently classifying and filtering spam messages and dealing with massive texts.In addition,this system has good expandability and practicality,providing a new solution to process massive spam messages.
Keywords/Search Tags:Spam SMS, Text Classification, Simhash, Spark
PDF Full Text Request
Related items