Parallel Overlapping Clustering Algorithm Based On Hadoop Platform

Posted on:2015-01-11

Degree:Master

Type:Thesis

Country:China

Candidate:Q Zhang

Full Text:PDF

GTID:2268330428976109

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

Clustering analysis is one of important field in data mining, and it is mainly used to find the distributed structure of data objects in space. Clustering analysis divides data objects into a plurality of clusters by calculating the similarity between objects. In the same clusters, the similarity of data objects is relatively higher, otherwise is not. In the real world, the clusters are often overlapping, In other words, the boundaries between clusters are not clearly, and there are overlapping data objects to some extent. These data objects may belong to multiple clusters. And the overlapping clustering algorithm could deal with them well.In this thesis, we present some methods to appraise the similarity between clusters. The framework of overlapping clustering consists of three parts:clustering, selection and aggregation. In the clustering part, any kind of clustering algorithms can be used. In selection part, some clusters which may have overlapping data objects are chosen. In this thesis, several selecting methods based on soft and hard clustering algorithms are presented. In aggregation part, the selected clusters are blended, so the overlapping data objects will be divided into two or more clusters. It makes that all the clustering algorithm can be used by establishing the framework of overlapping clustering. It is better for presenting the data objects’ distribution in space. And the framework is flexible in data processing.However, in the face of big data, the serial traditional clustering algorithm may not satisfy the actual demand in neither throughput nor computational ability. We implement a parallelization for overlapping clustering framework based on MapReduce programming model, and process the big data sets on Hadoop platform. The experimental results show that the parallel clustering based on MapReduce model may improve the efficiency of clustering algorithms.

Keywords/Search Tags:

Overlapping Clustering, Parallel Computation, Hadoop, MapReduce

PDF Full Text Request

Related items

1	Parallel Clustering Algorithm Based On MapReduce
2	Study On Iterative Mapreduce Computation Model For Clustering Analysis
3	Research And Implementation Of Parallel Clustering Algorithm Based On Approximate Spectrum Hadoop MapReduce
4	The Research Of Parallel Clustering Algorithm Based On Hadoop Platform
5	Research And Implementation Of Mapreduce-based Graph Clustering Algorithm
6	Research On Parallel Clustering Algorithm On Hadoop Platform
7	Design And Implementation Of Clustering Algorithm For Large Scale Chinese Short Text Based On Mapreduce
8	Study On Parallel Algorithm Of K-Medoids Based On MapReduce
9	Ant Colony Optimization Clustering Algorithm Design And Improvement Research Based On MapReduce
10	The Clustering Algorithm Based On Hadoop Parallel Analysis And Applied Research