Font Size: a A A

Research And Application Of Instance Selection For Lazy Learning

Posted on:2008-08-01Degree:MasterType:Thesis
Country:ChinaCandidate:W TangFull Text:PDF
GTID:2178360245497979Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Lazy learning is distinct from traditional eager learning, and it has many advantages, such as less training cost, richer hypothesis space, better asymptotic learning ability, ability of solving incremental learning tasks, etc. Therefore, lazy learning has been widely applied to data mining and web information processing areas. However, lazy learning also has a disadvantage of expensive querying cost, because all distances between query instance and every instance of training instances have to be computed when classifying query instances. In order to solve this problem, the first instance selection algorithm was presented just after lazy learning appeared, and even today new instance selection algorithms have been still designing. So, it is obvious that instance selection method is an important way to improve the efficiency of lazy learning. But we can also conclude that there are still many limitations in instance selection algorithms existed. Therefore, there is some work for solving this problem in this paper.First, we study a type of special neighborhood of instance. We conclude the common grounds of a type of classical algorithms, which are all based on a special neighborhood bound by instance's nearest unlike neighbor (i.e. nearest neighbor with different class label) and moreover all explicitly or implicitly use instance's two homogeneous instance sets (i.e. an instance's sets consist of instances with the same class label as its) resulted from this neighborhood. So it is obvious that the special neighborhood and the two homogeneous instance sets are very useful for instance selection. However, these algorithms all overlook the importance of nearest unlike neighbor which binds the special neighborhood. So, we defined instance's two new instance sets resulted from it's nearest unlike neighbors: the nearest unlike neighbor set (NearDiffSet) and the unlike instance coverage set (Diffcoverage). Then we conclude usefulness of the two sets for instance selection, and present a Boundary Instance Selection Algorithm (BIS) based on the two sets. Then we evaluate BIS algorithm on synthetically generated data and the data sets of UCI machine learning data repository. The result shows that BIS algorithm can largely reduce the size of training set and have good accuracy on many data sets, but it also have poor accuracy on some data sets, which makes us look instance selection problem from a deeper view: the classification potential of instance.Second, we study instance's classification potential evaluation function. Because classical instance selection algorithm ignores the influence of overlapping neighborhood, which could result in imprecise evaluation to instance's classification potential. So we use a more precise conception: Relative Coverage to instances with same class label (RC) so as to precisely evaluate an instance's correctly classification potential as viewed from instances from it's class. Moreover, because the NearDiffSet and DiffCoverage of instance may also be overlapped, we present the conception of an instance's Relative Coverage to instances from different class than its (RDC) as to precisely evaluate instance's potential to maintain classification decision-making border. At last, by combining RC with RDC, we defined a more precise and comprehensive evaluation function of instance's classification potential as the base of our better instance selection algorithm.Third, we present an better instance selection algorithm. Because there will be a problem of optimizing classification potential threshold if the classification potential evaluation function of instance is directly used to instance selection, we propose a consistent subset selection way so as to avoid the problem. Then, we present an Instance's Classification Potential Evaluation based Consistent Subset Selection Algorithm (IPECSS). Then, we analyze IPECSS algorithm by experiments, and evaluate it by comparing it with classical instance selection algorithms on both synthetically generated data set and 32 data set of UCI depository. The result shows that IPECSS algorithm can largely reduce the store set size, and meanwhile obtain the same or better accuracy than training set's, and remarkably improve the efficiency of classifier.At last, there are problems of poor classification efficiency and quality in collaborative filtering system based on lazy learning algorithm, so we apply IPECSS algorithm to improve it. The experiment result testifies the performance of IPECSS algorithm.
Keywords/Search Tags:Lazy learning, instance selection, classification potential evaluation, consistent subset
PDF Full Text Request
Related items