Font Size: a A A

Design And Implemention Of High Performance Text Clustering Algorithm Basic On Hadoop

Posted on:2014-10-30Degree:MasterType:Thesis
Country:ChinaCandidate:J P LinFull Text:PDF
GTID:2268330422459808Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The rapid development of information technology brings the rapid growth ofinternet data information,and most of the text information in the form ofwebpage.Mining mass text messages on the web, and these information fast andaccurate analysis and processing, to obtain useful information, thereby invincible inthis era of information technology, which has become the major organizations andindividuals need to be solved problem. Massive text data parallel processing in adistributed environment through data mining, text clustering technology is one of themost effective way to solve this problem.Text clustering technology is an important issue in the field of data mining, is anunsupervised machine learning method, the basic idea is the text pre-processing toform the computer can process data, text similarity calculation, the formation ofclustering results. In this paper, the basic principles of the analysis of clusteringtechnology, summarize existing clustering method in the advantages anddisadvantages of mass data processing and distributed parallel technology into thefield of text clustering, design and realize a distributed parallel computing essaytheclustering algorithm, which on the one hand to solve the traditional clusteringalgorithms in dealing with huge amounts of data due to the lack of data ishigh-dimensional, sparse, on the other hand to solve the data scale is too large tocause running slow, inefficient.The main work is to: ideas and theoretical knowledge of introductory textclustering algorithm, already exists classification clustering algorithm thought and itsrepresentative algorithm in-depth analysis and research, and summed up the variousclassification clustering the advantages and disadvantages of the algorithm and Scope.HDFS distributed file system and MapReduce programming model, in-depth study ofthe basic architecture of the open source distributed platform Hadoop and its keytechnologies: design based on the Hadoop distributed platforms distributed paralleltext clustering algorithm, and on this basis. The experiments show that the design ofdistributed parallel text clustering algorithm in dealing with the feasibility of themassive, high-dimensional data sets.
Keywords/Search Tags:Text Clustering, Data Mining, Hadoop, Distributed, MapReduce
PDF Full Text Request
Related items