Design And Implemention Of High Performance Text Clustering Algorithm Basic On Hadoop

Posted on:2014-10-30

Degree:Master

Type:Thesis

Country:China

Candidate:J P Lin

Full Text:PDF

GTID:2268330422459808

Subject:Software engineering

Abstract/Summary:

The rapid development of information technology brings the rapid growth ofinternet data information,and most of the text information in the form ofwebpage.Mining mass text messages on the web, and these information fast andaccurate analysis and processing, to obtain useful information, thereby invincible inthis era of information technology, which has become the major organizations andindividuals need to be solved problem. Massive text data parallel processing in adistributed environment through data mining, text clustering technology is one of themost effective way to solve this problem.Text clustering technology is an important issue in the field of data mining, is anunsupervised machine learning method, the basic idea is the text pre-processing toform the computer can process data, text similarity calculation, the formation ofclustering results. In this paper, the basic principles of the analysis of clusteringtechnology, summarize existing clustering method in the advantages anddisadvantages of mass data processing and distributed parallel technology into thefield of text clustering, design and realize a distributed parallel computing essaytheclustering algorithm, which on the one hand to solve the traditional clusteringalgorithms in dealing with huge amounts of data due to the lack of data ishigh-dimensional, sparse, on the other hand to solve the data scale is too large tocause running slow, inefficient.The main work is to: ideas and theoretical knowledge of introductory textclustering algorithm, already exists classification clustering algorithm thought and itsrepresentative algorithm in-depth analysis and research, and summed up the variousclassification clustering the advantages and disadvantages of the algorithm and Scope.HDFS distributed file system and MapReduce programming model, in-depth study ofthe basic architecture of the open source distributed platform Hadoop and its keytechnologies: design based on the Hadoop distributed platforms distributed paralleltext clustering algorithm, and on this basis. The experiments show that the design ofdistributed parallel text clustering algorithm in dealing with the feasibility of themassive, high-dimensional data sets.

Keywords/Search Tags:

Text Clustering, Data Mining, Hadoop, Distributed, MapReduce

Related items

1	Design And Implementation Of Clustering Algorithm For Large Scale Chinese Short Text Based On Mapreduce
2	Implementation Of Distributed Hierarchical Clusterting Algorithm Faced To Huge Commodity Dataset
3	Research And Application Of Hadoop Distributed Clustering Mining Method Based On Virtual Machine
4	Research And Application Of Text Mining Based On Hadoop
5	The Research Of Clustering Mining Based On Logistics History Data On The Hadoop
6	Parallel Clustering Algorithm Based On MapReduce
7	Research And Implementation Of Mapreduce-based Graph Clustering Algorithm
8	The Research And Application Of Security Log Clustering Mining Algorithm Based On Hadoop Platform
9	Research And Implementation Of Text Clustering Based On AP Algorithm
10	Research On Distributed Fast Clustering Algorithm Based On Mapreduce