Research And Application Based On Spark Text Mining Technology

Posted on:2017-01-09

Degree:Master

Type:Thesis

Country:China

Candidate:H X Jin

Full Text:PDF

GTID:2428330569485048

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the advent of the era of large-scale information data,massive data storage and large-scale data computing needs driven by large data storage and large data computing platform has been rapid development.The text mining algorithm in the traditional single-service environment takes a lot of time to extract and classify the text features.The MapReduce computing framework needs to store the intermediate data in the file system.Along with the increase of computing performance demands,Has been far from being able to meet the needs of users.A feature extraction algorithm for FEBLTL is proposed and implemented under the framework of Spark.Based on the preliminary selection of LDA features,the LDA features the lexical features,position,and weight of feature weight.The key features are analyzed by logistic regression.Supervised learning enhances the accuracy of key feature extraction.In the Spark environment,the maximum entropy text classification algorithm is improved and the maximum entropy binary function is redefined by the TextRank feature weight value.The improved maximum entropy algorithm is combined with K nearest neighbor,SVM and Naive Bayes algorithm in text classification The improved maximum entropy classification algorithm is superior to Naive Bayesian in classification accuracy and is close to K nearest neighbor and SVM.The offline model is loaded and analyzed by SparkStreaming,and the real-time batch mining of the text is carried out.The results of the research are applied to the analysis of textual analysis of the comments.The Spark-based text feature extraction and text classification are designed and implemented.Spark distributed computing framework is used to crawl the corpus of the comment text and extract the semantic tag of the comment text.On the comment text to achieve the emotional classification.The semantic tag extraction and emotion classification of comment text based on Spark can quickly analyze the text information.In the field of text mining,Spark parallel computing framework can quickly and real-time text data mining analysis,improved text feature extraction algorithm and textclassification algorithm can be more accurate extraction of text key features,more accurate classification of the text.

Keywords/Search Tags:

Spark Computation Framework, Big Data, Text Classification, Feature Extraction, Distributed Network Crawler

PDF Full Text Request

Related items

1	Research On Text Feature Extraction Method Based On Spark
2	Research And Implementation On Feature Extraction And Classification Of Chinese Text Based On SPARK
3	The Design And Implementation Of Large Text Classification Based On Spark
4	Research On Chinese Text Feature Classification Based On Distributed Framework
5	Parallelized Text Classification Algorithm Research
6	Research On Classification Of Massive Text Feature Under Distributed Architecture
7	Design And Implementation Of Distributed Focused Crawler System For Text Data
8	Design And Implementation Of Text Classifier Based On Neural Network With Spark
9	Research And Realization Of Chinese And English Vertical Search Engines On The Police
10	Design And Implementation Of Text Classification Model Based On The Improved TF-IDF Feature Extraction