Font Size: a A A

Research On Cost - Sensitive Feature Selection Algorithm Based On Semi - Greedy Strategy

Posted on:2016-04-21Degree:MasterType:Thesis
Country:ChinaCandidate:J XuFull Text:PDF
GTID:2208330470452886Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As one of the10challenging problems in data mining research, cost-sensitive learning gets significant development. Cost-sensitive feature selection is one of the classic sub-problems of cost-sensitive learning, and its purpose is to obtain the minimal test cost, misclassification or total cost. Many algorithms are proposed to the problem of feature selection on test cost. Heuristics search algorithms are efficient; however they often converge to local optimal solutions or only get the approximate optimal solution. Stochastic search algorithms produce best solution to a large extent, however the running time is not satisfactory. In this thesis, we propose a semi-greedy heuristic search procedure (SGHP) which combines the heuristics search algorithm and stochastic search algorithm, and is applied to two cost-sensitive feature selection problems. The main research and innovation of this thesis is shown as follow.(1) Design SGHP by combining the heuristics search algorithm and stochastic search algorithm. It consists of two critical technologies, creating feature subset based on semi-greedy strategy and competition mechanism. Greedy algorithm search for the global optimum by locally current optimal choices, however, it always converges to the local optimum. Semi-greedy randomly selects one of the current best k features to jump out the local restriction, so that it can obtain diversity solutions. It adopts competition mechanism in p candidate solutions to increase the possibility of getting the global optimization solutions.(2) Two algorithms SHCF and SFTC are proposed based on the approximation theory of the rough set theory, the information entropy. They are designed based on semi-greedy heuristic search strategy, which are namely for the problems of minimal test cost feature selection and feature selection with test cost constraint. There are two significant parameters for the performance of the algorithms, namely k and p. If k is small, semi-greedy algorithm coincides with the greedy one. If k is large, more candidate subsets with poor quality can be obtained. In addition, one can expect that if p>1, it can improves the quality of the results. However, the running time becomes longer. There is always an upper limit value of p, and the bigger p does not help improving the quality of results any more.(3)For proposed algorithms SHCF and SFTC, we do lots of experiments on four real-world datasets from UCI (University of California-Irvine) and three different test cost distributions. To adjust the parameter, we fix p while changing k and fix k as p is set as different values. We find the trade-off of k and the acceptable suitable p. The experiment results that compare our algorithms with the exiting heuristics and genetic algorithm show that, the proposed algorithms based on semi-greedy heuristic search strategy can avoid the disadvantages of the greedy ones to a large extent, and obtain the optimal output.
Keywords/Search Tags:Cost-sensitive learning, Rough set, Information entropy, Feature selection, Semi-greedy
PDF Full Text Request
Related items