Font Size: a A A

Parallel Overlapping Clustering Algorithm Based On Hadoop Platform

Posted on:2015-01-11Degree:MasterType:Thesis
Country:ChinaCandidate:Q ZhangFull Text:PDF
GTID:2268330428976109Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Clustering analysis is one of important field in data mining, and it is mainly used to find the distributed structure of data objects in space. Clustering analysis divides data objects into a plurality of clusters by calculating the similarity between objects. In the same clusters, the similarity of data objects is relatively higher, otherwise is not. In the real world, the clusters are often overlapping, In other words, the boundaries between clusters are not clearly, and there are overlapping data objects to some extent. These data objects may belong to multiple clusters. And the overlapping clustering algorithm could deal with them well.In this thesis, we present some methods to appraise the similarity between clusters. The framework of overlapping clustering consists of three parts:clustering, selection and aggregation. In the clustering part, any kind of clustering algorithms can be used. In selection part, some clusters which may have overlapping data objects are chosen. In this thesis, several selecting methods based on soft and hard clustering algorithms are presented. In aggregation part, the selected clusters are blended, so the overlapping data objects will be divided into two or more clusters. It makes that all the clustering algorithm can be used by establishing the framework of overlapping clustering. It is better for presenting the data objects’ distribution in space. And the framework is flexible in data processing.However, in the face of big data, the serial traditional clustering algorithm may not satisfy the actual demand in neither throughput nor computational ability. We implement a parallelization for overlapping clustering framework based on MapReduce programming model, and process the big data sets on Hadoop platform. The experimental results show that the parallel clustering based on MapReduce model may improve the efficiency of clustering algorithms.
Keywords/Search Tags:Overlapping Clustering, Parallel Computation, Hadoop, MapReduce
PDF Full Text Request
Related items