The Design And Implementation Of Large Text Classification Based On Spark

Posted on:2018-03-01

Degree:Master

Type:Thesis

Country:China

Candidate:F X Song

Full Text:PDF

GTID:2348330512480151

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet technology,large amounts of text data are derived on the Internet.However,most of the data hasn't been processed and classified,resulting in problems like spam and advertisement etc,which makes it difficult to distinguish useful information from useless one.Therefore,it is of great theoretical significance and practical value to investigate how to efficiently classify the massive text data.Firstly,this paper analyzes the problems of the traditional text classification algorithms.Their drawbacks are as follows:(1)the traditional algorithm used to select feature vectors is slow and inefficient.The feature space of the massive data tends to be infinitely open,while the batch mode used to select features offline is not only inefficient,but also causes severe problems such as memory overflow etc.(2)The traditional classifier is not suitable for the big-data calculation framework.Nowadays,Most of the big data is processed in the way of distributed parallel computing,while the traditional classification algorithms,such as SVM,naive Bayesian classifier,are not suitable for distributed parallel computing.In addition,the deep learning algorithm,though widely used in semantic recognition,does not work well in text classification system,because the model training is really time-consuming.To solve the problems mentioned,the present study focuses on two aspects:text representation and classifier designer.The main results are as follows:(1)In terms of text representation,this paper presents an online field feature selection algorithm(OFFS algorithm)based on streaming data,which solves the problems of low efficiency and memory consumption of traditional feature selection algorithms.With improvements in the vector space model,the new algorithm can select the real-time feature of the data and quickly generate text vector.(2)In the aspect of classifier design,an OFFS-BP neural network text classifier based on BP neutral network and OFFS algorithm is designed.It adapts to the distributed parallel computing,reduces the training time and balances the computation efficiency and classification accuracy.(3)Based on the Spark platform,the OFFS-BP neural network classifier is implemented.First,the Spark Streaming sub-framework is used to implement OFFS algorithm;and then,the Spark MLlib sub-framework is used to implement the BP neural network classifier;finally,the SparkStreaming and Spark MLlib frameworks are seamlessly connected through RDD,which is a Spark program model.The experimental results show that the OFFS-BP neural network classifier is more suitable for big data environment with less computation time and higher classification efficiency.

Keywords/Search Tags:

Big data, Text classification, Online feature selection, Neural network, Spark

PDF Full Text Request

Related items

1	Application Research Of Spark-based Multi-strategy Bat Algorithm In Text Feature Selection
2	Design And Implementation Of Text Classifier Based On Neural Network With Spark
3	Research On Text Classification Method Based On Improved Feature Selection Algorithm
4	Research And Application Based On Spark Text Mining Technology
5	Research On Improvement Of Chi-square Feature Selection And Word Vector Text Representation For News Classification
6	Classification Research On News Text Classification Based On Feature Selection Method
7	Feature Selection And Feature Representation Text Classification Based On Convolutional Neural Networks
8	Application Research Of Spark-based Dragonfly Algorithm In Text Categorization
9	Researches On Feature Selection In Text Classification
10	Study On Feature Selection And Feature Weighting Of Chinese Text Classification