Font Size: a A A

The Design And Implementation Of Large Text Classification Based On Spark

Posted on:2018-03-01Degree:MasterType:Thesis
Country:ChinaCandidate:F X SongFull Text:PDF
GTID:2348330512480151Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,large amounts of text data are derived on the Internet.However,most of the data hasn't been processed and classified,resulting in problems like spam and advertisement etc,which makes it difficult to distinguish useful information from useless one.Therefore,it is of great theoretical significance and practical value to investigate how to efficiently classify the massive text data.Firstly,this paper analyzes the problems of the traditional text classification algorithms.Their drawbacks are as follows:(1)the traditional algorithm used to select feature vectors is slow and inefficient.The feature space of the massive data tends to be infinitely open,while the batch mode used to select features offline is not only inefficient,but also causes severe problems such as memory overflow etc.(2)The traditional classifier is not suitable for the big-data calculation framework.Nowadays,Most of the big data is processed in the way of distributed parallel computing,while the traditional classification algorithms,such as SVM,naive Bayesian classifier,are not suitable for distributed parallel computing.In addition,the deep learning algorithm,though widely used in semantic recognition,does not work well in text classification system,because the model training is really time-consuming.To solve the problems mentioned,the present study focuses on two aspects:text representation and classifier designer.The main results are as follows:(1)In terms of text representation,this paper presents an online field feature selection algorithm(OFFS algorithm)based on streaming data,which solves the problems of low efficiency and memory consumption of traditional feature selection algorithms.With improvements in the vector space model,the new algorithm can select the real-time feature of the data and quickly generate text vector.(2)In the aspect of classifier design,an OFFS-BP neural network text classifier based on BP neutral network and OFFS algorithm is designed.It adapts to the distributed parallel computing,reduces the training time and balances the computation efficiency and classification accuracy.(3)Based on the Spark platform,the OFFS-BP neural network classifier is implemented.First,the Spark Streaming sub-framework is used to implement OFFS algorithm;and then,the Spark MLlib sub-framework is used to implement the BP neural network classifier;finally,the SparkStreaming and Spark MLlib frameworks are seamlessly connected through RDD,which is a Spark program model.The experimental results show that the OFFS-BP neural network classifier is more suitable for big data environment with less computation time and higher classification efficiency.
Keywords/Search Tags:Big data, Text classification, Online feature selection, Neural network, Spark
PDF Full Text Request
Related items