Font Size: a A A

Research On Improved K Nearest Neighbor Algorithm Based On Spark Cloud Computing Platform

Posted on:2017-01-11Degree:MasterType:Thesis
Country:ChinaCandidate:X K ChenFull Text:PDF
GTID:2308330485969650Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the modern era of the Internet, people have gradually entered the era of big data. In the face of an array of data, the user how to quickly find information to meet their own needs to become a hot research problem in urgent need of the academic community. Internet users efficiently tap the information requirements more stringent than forward. Data mining methods can be achieved from massive data efficiently dig out accurate data. Data mining classification algorithms to predict trends in the data, we recommend an effective means to meet the wishes of the user data. Wherein, K nearest neighbor classification algorithm is the large-scale spatial data query data mining one common classification algorithm. Through cloud computing platform may further increase the operating speed of data mining. Spark cloud computing platform is further optimized Hadoop cloud-based platform. Unlike Hadoop HDFS file system is totally dependent, Spark possible to handle data in memory set to further improve the data processing speed of the cloud platform.For users to quickly access the information needs, this paper on the Spark cloud computing platform, the use of parallel improved K nearest neighbor classification algorithm to achieve the goal of providing users with the recommended data quickly. This paper puts forward two points:first, this paper improves the K nearest neighbor classification algorithm index creation technology, and improves the efficiency of the algorithm to find the nearest neighbor data points. The traditional K nearest neighbor classification algorithm is the main control factor is to find the nearest neighbor data point efficiency is low, Tree K nearest neighbor is applied to the Kd classification algorithm index to create the search efficiency of the data structure. The running efficiency of Tree Kd is mainly in the process of the nearest neighbor search for non-leaf nodes. In this paper, by optimizing the KD tree to be classified data points and segmentation line plane intersection probability and reduce the K nearest neighbor classification algorithm in looking for the nearest neighbor data points when backtracking node number, improve the classification efficiency of the K nearest neighbor classification algorithm. Second, in this paper, the environment of the optimization algorithm is optimized, and the efficiency of the algorithm is improved. In this paper, the improved K nearest neighbor classification algorithm is implemented in parallel, so that the algorithm can adapt to the data processing model of cloud computing platform. In this paper, using the Spark cloud computing platform to achieve a parallel environment to improve the K nearest neighbor classification algorithm, in the premise of ensuring the accuracy of the algorithm, greatly enhance the classification speed of the algorithm.Experimental data on the selected paper selected UCI machine learning dataset. Experiment by conventional K nearest neighbor classification algorithm and improve the efficiency and accuracy of K nearest neighbor classification algorithm contrasting with single cases, improved K nearest neighbor classification algorithm and improved K nearest neighbor classification algorithm based on single algorithm processing efficiency Spark cloud platform to give the Spark cloud platform running Improvement K nearest neighbor classification algorithm, to ensure the accuracy of the classification algorithm premise, greatly improved the efficiency of the algorithm conclusions.
Keywords/Search Tags:Spark, cloud platform, classification, data mining, K nearest neighbor
PDF Full Text Request
Related items