Font Size: a A A

A Relative Position View-based Instances Set Reduction Algorithm

Posted on:2014-03-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y L YuFull Text:PDF
GTID:2268330395989196Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the soaring development of data acqusition and data storage techniques, even more and larger datasets are easily confronted in machine learning. As a consequence, the complexity of most learning algorithms increases at least linearly with the size of datasets. In order to avoid excessive storage and time consuming, and possibly to improve generalization accuracy by removing noise, researches about instance-set reduction for large datasets have always been interesting and challenging in machine learning.Based on previous brilliant works, we conducted further researches on instance-set reduction algortihms’reltated issues and categotization and their comparisons and so on. The major works of this dissertation are as follows:1. Reduction algorithms are highly correlated to other issues. Therefore, in this paper, we discussed several issues related to the problem of instance set reduction, providing a framework for furture discussion and designing of individual reduction algorihms. Those issues included representation chioce for instances, the direction of searching for reduction, evaluation strategies, similarity function and PCA, KD-Tree with Wilcoxon signed rank test.2. Reduction algorihms can be roughly classified into prototype selection and prototype generation based on their differences on producing reduced datasets. As the basic theories of those two kinds of reduction algorithms are different, we have made wide comparisons by analysing their weakness and strengths.3. In this paper, we propose a novel instance-based learning algorithm based on the Relative Position view, namely RePo. By observing the data points locally assembled as micro-cluster in a full view of dataset, which indicates points’ redundancy. As we have learned that one data point’s importance is implied by its nearest neighbor, that is to say, a point’s relative postion defines its importance. We treat each instance from training set as a p rototype, and develop two understandable definitions of replaceable structure, which result in simple and effective principles of dealing with points:retain the border points, delete the noisy and reduce the internal ones. By generating new prototypes, deleting noise and close border points, RePo becomes quite effective in storage reduction and achieve4. To objectively and systematically evaluate RePo’s performance, we’ve chose17existing classical and effective reduction techniques to compare with RePo performing on16UCI datasets during classification task in terms of generalization accuracy, reduction size and the tradeoff of both. Wilcoxon signed rank test (a=0.1) has been employed to verify whether the differences between RePo and others are statistically significant. It has been demonstrated that RePo outperforms the most of others in terms of storage requirment while guarantees generalization accuracy quite close to the best. The performance of RePo is quite close to SSMA and RMHC, while the average generalization accuracy achieved by RePo is9.5%and1.3%respectively higher than the former two.After all, RePo retains important border points, edits out the noisy and reduces most internal points by generating artificial data points, which has achieved the goal of combination of prototype selection and prototype generation. In our experiments, RePo has demonstrated its effectivity and its outstanding ability in reducing datasets’storage requirement while guaranteeing their generalization ability. But that is just part of the whole picture. RePo still shares its poor designing of termination and redundancy detecting, which leaves much room to advance.
Keywords/Search Tags:Nearest Neighbor Rule, Instance-set Reduction Algorithm, PrototypeSelection, Prototype Generation
PDF Full Text Request
Related items