Font Size: a A A

Research And Implementation Of Topic-based Text Mining And Visualization System

Posted on:2019-09-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y J GuoFull Text:PDF
GTID:2438330572450332Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
With the popularity of the Internet and the mobile Internet,nowadays,human beings are in the age of information explosion,Internet generates data faster and faster,and text data is the most commonly used format in the Internet.The massive data stored in the Internet not only make user convenient and improve user experience,but also extract abstract information and intrinsic value from data,then apply it to recommender systems,user portrait and other fields.Topic classification technology has always been a hot topic in text mining,because of the traditional topic classification algorithm is inefficient and can not reveal the implicit topic information from text data.This paper gives the LDA classification algorithm,and propose an improved algorithm scheme,that reduces the noise,improves the effectiveness and efficiency of classification.Finally in order to verify the algorithm's application,designed a complete visual classification system,and apply the big data technology to realize the algorithm for processing the text classification problem under the massive data.The main work of this article is the following four parts:1.Researching the current situation of the traditional topic classification,text classification methods and the existing large data processing technology and scheme of distributed computing model,then analysing the text classification algorithm based on key words,in order to improve the subject classification effect,and mine the implied topic information and text semantics of corpus set,and then study the algorithm of LSA and PLSA.Finally,by using the latent semantic analysis LDA algorithm for topic classification of document set,and using the feature selection algorithm optimization.2.Design and developing the topic classification system.In this paper,crawler technology is used to crawl the health website for the elderly person,and analyzing the data of web page articles,then the result data set in practical application environment is obtained.In Chinese text classification,text segmentation and stop words filtering is an important first step,by comparing the practice,using Jieba segmentation tool to segmentate the original text,then stop word filtering deal with the segmentation results using the regular matching mechanism,the method improve stop word filtering efficiency and accuracy.Then,this paper use LDA algorithm to classify text sets.Finally,the system visualize the theme results in Web system by self designed visualization form,and enhance user experience.3.In order to validate the application algorithm in the present data scenarios,using Hadoop as a big data application platform in this paper,and choose Sqoop as a data conversion tool between database and HDFS,Mahout is selected to implement the complex algorithm as a basic machine learning library,subject classification system is designed and developed under the big data technology.The working mechanism is firstly studied under Hadoop platform is the core component in HDFS distributed storage and distributed computing framework Map Reduce,then focusing on research in the Map Reduce programming framework,designning the core algorithm using distributed system,whit the distributed storage technology can greatly expand the scale of data processing.Finally,the classification results attach to visualization and Web display components,improve the interaction of big data classification system.
Keywords/Search Tags:Topic Classification, LDA Algorithm, Hadoop, Text Preprocessing, Data Visualization
PDF Full Text Request
Related items