Font Size: a A A

Research And Application Of Text Mining Based On Hadoop

Posted on:2017-05-11Degree:MasterType:Thesis
Country:ChinaCandidate:Z WangFull Text:PDF
GTID:2308330485969610Subject:Control Theory and Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology, network information has become the main source of information for people to obtain information. People can easily get massive information through Internet, but the explosive growth of information also brings some inconvenience. It is more and more difficult for users to quickly and effectively filter out acquire the valuable information in the face of from massive data. In front of massive data, the traditional single node serial computing model has failed to meet the requirements of vast amounts of information processing, distributed technology therefore brings the new solutions, through the distributed parallel processing mode, the massive data calculation can be completed quickly and efficiently the massive data calculation. At present, the present cloud computing which is originated from distributed calculation has obvious advantages in dealing with massive data and high concurrency problems.Recent years, Hadoop has become a cloud platform that is popular in application, the application of the cloud platform, its hardware can be constructed by common PC cluster and therefore more economic, meanwhile, it supports storage and processing of massive data. Text mining is a heated branch of data mining, which is widely used in the fields of search, classification, recommendation and so on. The conventional serial computation mode is applied in the current text mining area, and it is difficult to meet the requirements of the massive text data. Based on this, this paper combines Hadoop platform with text mining technology. This paper focuses on the research of Hadoop text preprocessing and CURE clustering algorithm. The main work of this paper includes the following:(1) Briefly introduce the research background and significance, distributed technology, cloud platform and text mining and other related technologies.(2) Research on the text pretreatment process of text mining, put forward a new method of constructing stoplist. The construction process of Hadoop platform is introduced briefly. The text preprocessing process is MapReduced, the text preprocessing is completed through parallel computing of Hadoop platform. The efficiency of the parallel processing of single machine serial processing and Hadoop platform are compared and analyzed.(3) Brief introduce CURE clustering algorithm. Presents an optimal formula of TFIDF, and applied it to the process of MapReduced of CURE algorithm. Analysis and comparison are made before and after optimizing of TFIDF formula, demonstrating that the optimization of the TFIDF formula is more effective than the conventional TFIDF formula.(4) Test and analyze CURE algorithm in the Hadoop platform, the operation efficiency of different clusters. Through the statistical analysis of the related calculation results, further proves the advantages of the parallel processing of Hadoop platform.By experimental analysis for proposed novel stop list construction method and optimization of TFIDF, the research value has been demonstrated. At the same time, the validity of cloud computing technology in texting mining application is verified which provide a new idea for text mining in future.
Keywords/Search Tags:Distributed, Cloud computing, Hadoop, Text mining, Text clustering
PDF Full Text Request
Related items