Research On Distributed Fast Clustering Algorithm Based On Mapreduce

Posted on:2018-03-11

Degree:Master

Type:Thesis

Country:China

Candidate:Q L Wu

Full Text:PDF

GTID:2348330512481645

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of information technology,the scale of the data is increasing exponentially.There are great challenges of the traditional clustering algorithms: 1)a great deal of outliers,high data redundancy and low value density are existed in big data,and the accuracy of clustering algorithm is low;2)the time cost of searching the nearest neighborhood is high when dealing with the large dataset,and the execution efficiency of the clustering algorithm cannot meet the actual demand.To overcome the drawbacks above,a distributed fast clustering algorithm has been proposed based on the full analysis of the characteristics of the data using MapReduce,which realizes the high efficiency and high precision.In order to overcome the high data redundancy and large worthless data points,a distributed data reduction algorithm based on MapReduce has been proposed in this paper.A new sampling algorithm has been used to calculate the rectangular domain and sampling domain of every data point,and the sample points are determined in the sampling domain.Then the sample points are used to expand the sampling in order to reduce the original dataset.Finally,a representative verification algorithm has been proposed to test the sample set.The data reduction algorithm based on sampling has increased both I/O and network costs significantly.To solve the problems of high cost of searching the nearest neighborhood and low efficiency of clustering algorithm,the Map task is used to partition the sample set with the same data size,and the Reduce task is used to cluster the data subset in this paper.So an enhanced density clustering algorithm based extended range query has been proposed.Firstly,an extended range query algorithm based on fixed-grids is used to determine the nearest neighbors and reverse nearest neighbors,and the influence space neighborhood of each point must be established.Then a computational method of outlierness function is presented to distinguish the border points and noise points accurately.The Reduce clustering task output the local clustering results.In order to obtain the global clustering result for the whole dataset,a new local cluster merging algorithm has been proposed.The combined local clusters can be obtained by determining the distribution relationship of the local clusters using the calculation of the distance between clusters.Then the connected subgraph discovery method is used to merge the local clusters.At last,we can get the global clustering result.Our experimental results indicate that the algorithm we proposed performs better in terms of reducing the size of large data,and ensuring the distribution consistency of sample points and theoriginal data points.And the data redundancy is reduced without the loss of information.Meanwhile,the proposed algorithm can find clusters of arbitrary shape and density,and perform better in terms of efficiency and accuracy.

Keywords/Search Tags:

Big data, Cluster, MapReduce, Parallel computing, Data mining

PDF Full Text Request

Related items

1	The Design And Implementation Of Parallel Computing Platform Based On MapReduce
2	Parallel Data Mining Theory Research And Application
3	Parallel Frequent Itemset Mining Based On MapReduce
4	Based On The Parallel Implementation Of Multi-node Data Mining Algorithm
5	Frequent Subgraph Mining In Graph Databases Based On MapReduce
6	Multidimensional Data Model For Mining And Analysis Based On Multiple Structure Data Cube
7	MapReduce-based Parallel Data Mining Services For TCM
8	The Design And Implementation Of A MapReduce Computing Framework Based On GPU Cluster
9	Research On Text Mining Based On MapReduce
10	The Research And Implement Of Data Mining Algorithms Based On Hadoop