Font Size: a A A

A Prototype Selection Algorithm Based On Natural Neighbor And MST

Posted on:2017-09-20Degree:MasterType:Thesis
Country:ChinaCandidate:L J DuanFull Text:PDF
GTID:2348330503965774Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In the field of data mining and machine learning, k-nearest neighbor because of its simple and effective has got a great development and wide application. However, the traditional k nearest neighbor has two major limitations: the difficult choice of parameters k and the excessive time and space complexity requirements in the case of large data sets. For the usage of KNN, different parameter values of k may have a significant impact on the effectiveness of the algorithm. While the training set for KNN classification usually contains many impurities and much useless redundancy information, which not only cause enormous computing capacity, but also may affect the accuracy of classification algorithms in the use of these data sets for KNN classification task. Therefore, it is very necessary for pattern recognition, classify learning and other issues to preprocess the original training set effectively. Data reduction is an important means of preprocessing data sets, and prototype selection as a common data reduction method, which selectively reduces the original data set to obtain a prototype set with high classification contribution. The prototype set with certain representative not only be able to reflect the distribution of original data set, but also ensure that the classification accuracy based on that should not be lower or even to enhance. After processing the data set by prototype selection algorithm, on the one hand, the size of data sets would be reduce, so the storage rate of space should be lower and the efficiency of computing learning tasks should be improve; on the other hand, which can also enhance the classification accuracy of the data set.Although the high complexity of time and space for KNN algorithm have been studied for many years, but which still has not been solved in reality. Many results of prototype selection algorithm always get a lower prototype retention rate and higher classification accuracy. In order to reduce the size of data sets, while ensuring the classification accuracy based on the final reserved prototype set is not reduced or even improved, through the study and analysis for a number of prototype selection algorithm, this paper presents a new prototype selection algorithm, which is the prototype selection algorithm based on natural neighbor and the minimum spanning tree. The main contents are as follows:(1) The concepts and related issues of k-nearest neighbor classification and prototype selection are summarized and elaborated, different classification methods of prototype selection algorithm are given, the relationship between k nearest neighbor classifier and prototype selection problem is described. Finally, some prototype selection algorithms that are classical and proposed in recent year are made a detailed description.(2) For the problem that the choice of parameters k is hard for KNN. The concept of natural neighbor is introduced, and then study aimed to evaluate characteristics of natural neighbor for the data sets in different data structure, such as uniform distribution and Gaussian distribution, and different data scale. Specific research the stability and trends of average number of natural neighbors in different data structures and data scale, and also research to construct several useful adaptive nearest neighborhood graph. Experiments show that average number of natural nearest neighbors of uniform distributed data sets is more stable than Gaussian distribution, the value of is easily influenced by outliers.(3) For the problems that the reduction rate of prototype selection is not high enough and the classification accuracy rate is declined of existing prototype selection algorithm, prototype selection algorithm based on natural neighbor and the minimum spanning tree is proposed, some key prototypes with better contribution to classification are retained, while most of the points have useless to classification are removed. Different from other prototype selection algorithm, the proposed algorithm uses a natural neighbor, a novel concept of neighbor, to preprocess dataset, and then establishes the minimum spanning tree based on the special control strategies. Based on the minimum spanning tree, the most useful boundary prototypes are reserved, and a small amount of internal representative prototype are generated to preserve. The proposed algorithm uses natural neighbors to preprocess datasets, instead of preprocessing operation based KNN, effectively eliminates parameter selection problem. The experiments by artificial dataset show that PS2 NM algorithm can effectively remove the noise data and the redundant prototypes, the retained prototype set has the same distribution to the original training set; by the UCI datasets experiments, PS2 NM is compared with the traditional prototype selection algorithm(CNN, ENN) and recent prototype selection algorithm(TRKNN, PSC, BNNT), the results show that the proposed algorithm effectively reduces the number of prototypes, the final prototype sets chose by algorithm maintain the same level of classification accuracy as the traditional KNN. Moreover, considering the classification accuracy and prototype retention, the proposed algorithm still has more advantages than other prototype selection algorithm.
Keywords/Search Tags:K nearest neighbor, prototype selection, natural neighbor, minimum spanning tree, classification
PDF Full Text Request
Related items