Research And Application Of Text Mining Based On Hadoop

Posted on:2017-05-11

Degree:Master

Type:Thesis

Country:China

Candidate:Z Wang

Full Text:PDF

GTID:2308330485969610

Subject:Control Theory and Engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet technology, network information has become the main source of information for people to obtain information. People can easily get massive information through Internet, but the explosive growth of information also brings some inconvenience. It is more and more difficult for users to quickly and effectively filter out acquire the valuable information in the face of from massive data. In front of massive data, the traditional single node serial computing model has failed to meet the requirements of vast amounts of information processing, distributed technology therefore brings the new solutions, through the distributed parallel processing mode, the massive data calculation can be completed quickly and efficiently the massive data calculation. At present, the present cloud computing which is originated from distributed calculation has obvious advantages in dealing with massive data and high concurrency problems.Recent years, Hadoop has become a cloud platform that is popular in application, the application of the cloud platform, its hardware can be constructed by common PC cluster and therefore more economic, meanwhile, it supports storage and processing of massive data. Text mining is a heated branch of data mining, which is widely used in the fields of search, classification, recommendation and so on. The conventional serial computation mode is applied in the current text mining area, and it is difficult to meet the requirements of the massive text data. Based on this, this paper combines Hadoop platform with text mining technology. This paper focuses on the research of Hadoop text preprocessing and CURE clustering algorithm. The main work of this paper includes the following:(1) Briefly introduce the research background and significance, distributed technology, cloud platform and text mining and other related technologies.(2) Research on the text pretreatment process of text mining, put forward a new method of constructing stoplist. The construction process of Hadoop platform is introduced briefly. The text preprocessing process is MapReduced, the text preprocessing is completed through parallel computing of Hadoop platform. The efficiency of the parallel processing of single machine serial processing and Hadoop platform are compared and analyzed.(3) Brief introduce CURE clustering algorithm. Presents an optimal formula of TFIDF, and applied it to the process of MapReduced of CURE algorithm. Analysis and comparison are made before and after optimizing of TFIDF formula, demonstrating that the optimization of the TFIDF formula is more effective than the conventional TFIDF formula.(4) Test and analyze CURE algorithm in the Hadoop platform, the operation efficiency of different clusters. Through the statistical analysis of the related calculation results, further proves the advantages of the parallel processing of Hadoop platform.By experimental analysis for proposed novel stop list construction method and optimization of TFIDF, the research value has been demonstrated. At the same time, the validity of cloud computing technology in texting mining application is verified which provide a new idea for text mining in future.

Keywords/Search Tags:

Distributed, Cloud computing, Hadoop, Text mining, Text clustering

PDF Full Text Request

Related items

1	Cloud Computing-based Research On Text Mining Techniques
2	Design And Implemention Of High Performance Text Clustering Algorithm Basic On Hadoop
3	Research On Parallel Processing Technology Of Large-scale Text Mining Under Cloud Computing Environment
4	Research On Key Problems In Text Mining Based On Cloud Method
5	Research On Parallelization Of Text Clustering Based On Hadoop
6	Research On Parallelization Of Text Clustering Based On Hadoop Cloud Computing Platform
7	Research On Text Clustering Algorithm Based On Cloud Computing
8	Research On The Hadoop-based Distributed Full-text Retrieval And Related Technologies
9	The Research And Development Of Distributed Web Text Retrieval System Based On Hadoop
10	The Key Technologies Research Of Web Text Mining Based On Hadoop