Font Size: a A A

Clustering Analysis Based On Hadoop

Posted on:2017-04-25Degree:MasterType:Thesis
Country:ChinaCandidate:M M LiFull Text:PDF
GTID:2278330488450008Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of the information technology, the application of big data has been extended. All walks of life are also increasingly concerned about the big data. In fact, it pays at-tention to the information contained the big data. The application of various technologies makes it possible to obtain useful information from big data more easily, clustering analysis is an im-portant technology of them. Processing large data sets when using traditional clustering algo-rithms is difficult to achieve the expected clustering effect, and it takes a long time. Besides, it can’t meet the current demand of the massive data clustering.In order to improve the effect and performance of clustering, the traditional clustering algo-rithm is applied to distributed platforms in the most of research. In present, the clustering algo-rithms on distributed platforms process mainly the traditional serial clustering algorithms in parallel. The emergence of Hadoop makes the user analyze large data sets on inexpensive clusters, reduces the cost of analysis.The main research contents of the paper are as follows:(1) Grid clustering not only processing quickly, but also clustering process is only rele-vant to the number of grid, so the grid is introduced into the K-Means algorithm in this paper. By setting a certain threshold for de-noising of the data within the grid.(2) Pre-processing the input data of k-means algorithm by grid process, using the part of the data within the grid to replace the entire data within the grid, and then reducing the amount of calculation data. The pre-process first clusters the data within the grid, and obtain k clustering centers. Using the k cluster centers instead of the entire data within the grid, then participating k-means clustering. The number of cluster centers is determined by the number k of k-means clustering algorithm, that is to say, using k points selected by each of the grid to replace the en-tire data within the grid. The data within the grid involved in the final of K-Means clustering di-rectly when the number of data points less than k, ignoring the step of above K-Means clustering. At the same time, the grid is screened by setting the grid noise threshold.(3) Optimize data representative point within the grid through the thought of "self-government", namely:each grid is capable of autonomous decisions on behalf of their number of points by using an elastic k value. Select the initial cluster centers from each grid through the method of the most distant. And exploring the possible value of k makes the objective function do not change with the decreases of k value, so as to obtain the corresponding cluster centers.(4) The clustering algorithm is implemented in the environment of Hadoop, and analy-sis by using the data from the UCI. By comparing the improved clustering algorithm with the existing algorithm in the aspect of the cluster quality and the running time. The results showed that the improved algorithm has better clustering results. Besides, the advantage of this algorithm is more and more obvious with the increase of the node, and the quality of clustering has been improved to a great extent.
Keywords/Search Tags:Clustering analysis, distributed, Hadoop, grid k-Means, elastic K
PDF Full Text Request
Related items