Font Size: a A A

Parallelization Study Of Improved Clustering Algorithm On MapReduce Programming Model

Posted on:2017-08-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y M DongFull Text:PDF
GTID:2348330482984831Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the increasing of the scale of data has result into the diversity of data rapidly increasing. So the data has the characteristics of massive, heterogeneous, dynamic and diverse, which brings great difficulty to deal with that data. Traditional data mining methods have been unable to meet the needs of the modern, the high-speed development of data is not just rely on the hardware that can able to solve the problem. So it is a only way that completely changed the original calculation model solving the problem. In that way can we find valuable data from the massive data, and can society get benefit form mass of data. The MapReduce programming model which is proposed by Google, which provides a new solution to the problem of mass data. MapReduce framework has been widely used in the distributed data processing. The framework can achieve data and stratigraphic details from the existing complex data. This model can achieve the task scheduling, data partition, with high strength data fault tolerance and it can greatly facilitate the development.Clustering is an important aspect of data mining. Clustering analysis has been widely used in the industry and business as well as the daily life. Many excellent clustering algorithm has provided us great convenience. However, with the rapid growth of data, the traditional clustering algorithm can not meet the needs of the modern. The time has beyond the scope of the people in dealing with massive data set. Genetic algorithm and K-means algorithm are a branch of data clustering. In this paper, genetic algorithm and K-means algorithm are researched, and improved performance on the parallel programming model. The improved algorithm is implemented on MapReduce model.After researching genetic algorithm has slow convergence speed when it is dealing with massive amounts of data. Becase of the initial clustering center not sure, K-means algorithm has slow convergence speed when it is dealing with massive amounts of data. This paper introduced an improved genetic algorithm and k-means algorithm. Finally, the improved algorithm is tested on MapReduce programming model. The experimental results show that the improved algorithm not only has a higher speedup, but also has a faster convergence.
Keywords/Search Tags:clustering algorithm, MapReduce, K-means, speedup
PDF Full Text Request
Related items