Font Size: a A A

Research On Instance Selection Algorithms For Machine Learning

Posted on:2014-02-27Degree:MasterType:Thesis
Country:ChinaCandidate:L LiuFull Text:PDF
GTID:2248330395497730Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In machine learning, datasets for learning algorithms are often filled with a lot ofimpurity and useless information, which causes not only huge calculation capacity, but alsoaffecting the accuracy of algorithms. Thus, the effective data preprocessing is the premise ofclassification and pattern recognition problems. Data reduction is a kind of important datapreprocessing means, and instance selection which as a common data reduction method,removes the noisy examples and redundant instances in datasets, then achieves a compressedsubset that has similar performance with the original dataset. On one hand, it can reduce thesize of datasets, and improve the processing efficiency. On the other hand, it can promote theclassification accuracy of datasets.The traditional instance selection algorithms usually separate the both, they or focus onimproving the classification accuracy of datasets yet ignoring the purpose of reduction, orreduce the size of datasets as much as possible yet causing the classification accuracydecreases. According to the unbalance between the classification accuracy and reduction,through the research and analysis of the instance selection algorithms, in order to reduce thesize of datasets effectively, at the same time, to improve the classification accuracy of datasets,this paper puts forward to a new instance selection algorithm, that is redundant instance pairelimination algorithm based on the similar instance pair. The main researches are as follows:1. Summarize the concepts and related problems that are often involved in the instanceselection. Introduce different classification ways of instance selection algorithms, and presentseveral common instance selection algorithms in details according to their belonging ofclassification. And expound the relationship between the instance selection and k-nearestneighbor classification.2. Put forward to the concept of the Nearest Similar Pair, give the definition, and discussthe characteristics of it. It can well describe the internal redundant instances of datasets withthe concept of Nearest Similar Pair. Through calculating the Nearest Similar Pair existed indatasets, remove the eligible instances, and construct the redundant instance pair eliminationalgorithm. Choose10standard datasets on UCI and one artificial dataset in the experimentalprocess, the results show that, this algorithm can reduce the size of datasets effectively, at thesame time, it can obtain higher classification accuracy than original sample sets. And take acontrast test between this algorithm and the ENN algorithm, the results show that thisalgorithm can keep the ENN algorithm in classification accuracy; at the same time improve19%or so in average storage compression ratio.3. Analyze the advantages and disadvantages of the RIPE algorithm, and expand and research algorithms based on it. Due to the RIPE algorithm process only conducting oneiteration, this paper joins the repeat iteration processing on the basis of RIPE, and constructs akind of repeated redundant instance pair elimination algorithm (RRIPE). Through the contrasttest, it shows that this algorithm can obtain higher storage compression ratio, the size ofdatasets has significantly reduced, and the classification accuracy of part of datasets improvethan original datasets. Contrast with the RIPE algorithm, this iterative algorithm’s advantagelies in it can obtain high storage compression rate, so the RRIPE algorithm has certainsignificance for processing large datasets.
Keywords/Search Tags:machine learning, data classification, instance selection, k nearest neighbor, NearestSimilar Pair
PDF Full Text Request
Related items