Font Size: a A A

Research And Application Based On Spark Text Mining Technology

Posted on:2017-01-09Degree:MasterType:Thesis
Country:ChinaCandidate:H X JinFull Text:PDF
GTID:2428330569485048Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the advent of the era of large-scale information data,massive data storage and large-scale data computing needs driven by large data storage and large data computing platform has been rapid development.The text mining algorithm in the traditional single-service environment takes a lot of time to extract and classify the text features.The MapReduce computing framework needs to store the intermediate data in the file system.Along with the increase of computing performance demands,Has been far from being able to meet the needs of users.A feature extraction algorithm for FEBLTL is proposed and implemented under the framework of Spark.Based on the preliminary selection of LDA features,the LDA features the lexical features,position,and weight of feature weight.The key features are analyzed by logistic regression.Supervised learning enhances the accuracy of key feature extraction.In the Spark environment,the maximum entropy text classification algorithm is improved and the maximum entropy binary function is redefined by the TextRank feature weight value.The improved maximum entropy algorithm is combined with K nearest neighbor,SVM and Naive Bayes algorithm in text classification The improved maximum entropy classification algorithm is superior to Naive Bayesian in classification accuracy and is close to K nearest neighbor and SVM.The offline model is loaded and analyzed by SparkStreaming,and the real-time batch mining of the text is carried out.The results of the research are applied to the analysis of textual analysis of the comments.The Spark-based text feature extraction and text classification are designed and implemented.Spark distributed computing framework is used to crawl the corpus of the comment text and extract the semantic tag of the comment text.On the comment text to achieve the emotional classification.The semantic tag extraction and emotion classification of comment text based on Spark can quickly analyze the text information.In the field of text mining,Spark parallel computing framework can quickly and real-time text data mining analysis,improved text feature extraction algorithm and textclassification algorithm can be more accurate extraction of text key features,more accurate classification of the text.
Keywords/Search Tags:Spark Computation Framework, Big Data, Text Classification, Feature Extraction, Distributed Network Crawler
PDF Full Text Request
Related items