Font Size: a A A

Research Of Adaptive Local Hyperplane K-nearest Neighbor Classification Algorithm Based On Hadoop Platform

Posted on:2012-04-28Degree:MasterType:Thesis
Country:ChinaCandidate:C H ZouFull Text:PDF
GTID:2218330335495527Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Adaptive local hyperplane K Nearest Neighbors Classification Algorithm (ALH) is an improvement on the KNN algorithm, which has a more excellent performance in many data sets. However, we found that ALH algorithm is large-scale and complex in computing. For each testing instance, ALH has to calculate the distance between the testing instance and the local hyperplane constrcuted by its K nearest neighbor of each class, with some complex matrix calculation such as matrix multiplication, matrix inverse and so on. In a number of experiments we find that the traditional ALH algorithm can only apply to small-scale data. When the data dimension increases or the number of samples increases, the traditional ALH algorithm will obviously slow down because of the rapid increasement of calculation, which is the bottlenecks of many traditional data mining algorithms.Under the circumstance of the massive calculation, we do find that when we use ALH to calculate such hyperplane distance on some data sets (such as the high-dimensional data set with large number of samples), we have to finish some very time-consuming matrix operations:the sum, multiplication, or subtraction of two matrixes with tens of thousands of elements, or the inversion of a large matrix. In this paper, we will discuss the ALH algorithm and analyze the proportion of running time in some important steps of ALH. Then we find that the most time-consuming steps are as follow:finding the k neighbors of one testing instance in every class, and calculating the distance between the testing instance and the local hyperplane constrcuted by its K nearest neighbor of each class. In addition, in our experiments we find that with the increasement of the value of k, the running time of calculating the local hyperplane distance will increase more obviously.Recent years, with the rise of the concept of cloud computing, massive data processing and massive calculation are gradually migrating to cloud computing platform, which is a clear trend. Hadoop, as the most outstanding open source implementation of cloud computing, is used in many areas, including data mining and machine learning. The Hadoop platform is good at handling large-scale data and large-scale computing. For the above reasons, we try to make ALH run on Hadoop platform. When we correctly redesign the algorithm and make it run on Hadoop, we can use the power of Hadoop clusters and solve these problems. In this paper, the proportion of some import steps' running time is analyzed, and then we redesign and parallelize some important steps of ALH algorithm. Based on the research of the Hadoop platform, we propose the corresponding MapReduced ALH algorithm, enabling the ALH algorithm running on Hadoop. With the effective use of the computing power of multiple computers, in the experiments, we can see that the ALH algorithm based Hadoop can effectively schedule computing resources of each computer and dramatically reduce the running time on some high dimensional data sets.
Keywords/Search Tags:adaptive local hyperplane K nearest neighbor classification algorithm, ALH, Hadoop platform, high dimensional data, MapReduce
PDF Full Text Request
Related items