Font Size: a A A

Study On Key Techniques Of Distributed Data Mining Based On Hadoop

Posted on:2016-06-15Degree:MasterType:Thesis
Country:ChinaCandidate:H LiFull Text:PDF
GTID:2308330473955844Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
A variety of data on pictures, videos, documents and so forth are generated and stored everyday with the further development of computer technology in various fields. A critical problem in information industry is how to mining useful knowledge from these massive data and apply it to the production of human social practice. However, too much data also raises the problem of "data-rich, information-poor", which reveals the high cost of knowledge. To solve this problem, the distributed data mining technique is proposed.Various algorithms are proposed for different tasks in the application of the distributed system with the decades of development. However, how to improve the traditional algorithm to adapt to the new distributed environment is an important problem in the distributed data mining.The Hadoop open source distributed system has been widely employed and achieved lots of success at home and abroad after years of development. It is an excellent platform to achieve distributed data mining system. Therefore, in this thesis, some clustering and classification algorithms have been improved and optimized, and we provide an implementation of Hadoop platform. The main work of this thesis includes the following two aspects:(1) For the K-means++ algorithm, the initial centers are selected randomly, which is not efficient enough. In our thesis, we select them based on the probability which improves its clustering process efficiently. Moreover, the serialization feature of the K-mean++ algorithm makes it difficult to be achieved in the distributed system, and when calculating the distance, it ignores the differences in the properties of the effects of clustering results. In this thesis, we improve the iterative process of the K-mean++ algorithm so that it can be implemented on Hadoop platform based on the Map-Reduce model. To further improve the quality of clustering, we apply the concept of property rights values to the process of calculating the distance, making important properties possess greater impact on the results. Experimental results show the effectiveness and parallel capabilities of the improved algorithm.(2) Weights and network structure of the BP(back-propagation) artificial neural network which affect the classification results significantly are set manually. The improper values of them will lead to the very slow convergence or non-convergence, and the algorithm is also likely to get local optima value. So how to choose proper initial value is very important. In our thesis, the genetic algorithm is applied to the network to pre-train the initial value. The network is constructed according to the pre-trained initial parameter values. To further elevate the training accuracy, we improve the selection method in genetic algorithm so that the genetic algorithm can quickly converge. In the end, our thesis gives the pseudo-code of the improved algorithms based on the Map-Reduce model on the Hadoop platform. Experimental results indicate the effectiveness and the parallelism of the improved algorithms.
Keywords/Search Tags:Hadoop, distributed data mining, K-means++ algorithm, BP artificial neural network
PDF Full Text Request
Related items