Font Size: a A A

Research On Attribute Reduction And Decision Tree In Cost-sensitive Learning

Posted on:2015-02-14Degree:MasterType:Thesis
Country:ChinaCandidate:Z L XuFull Text:PDF
GTID:2298330467466683Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Data mining is the process of discovering interesting patterns and knowledge from largeamounts of data. The data sources can include databases, data warehouses, the Web, otherinformation repositories, or data that are streamed into the system dynamically.We focus on two types of important costs, namely test costs and misclassification costs.Test costs are time, money, or other resources spent in obtaining data items related to someobjects; misclassification costs correspond to the penalty of deciding that an object belongs toclass A when its real class is B. Considering the cost, cost-sensitive learning is an activeresearch topic.Recently, researchers have proposed backtracking algorithm and heuristic algorithms forthe problems in cost-sensitive rough sets. Backtracking can find the optimal solution, but it isnot efficient for large datasets. Existing heuristic algorithms, including information gain andgenetic algorithms are not effective. This dissertation designs algorithms based on ant colonyoptimization and simulated annealing to deal with problems in cost-sensitive rough sets.The decision tree is an effective classification approach in data mining and machinelearning. Cost-sensitive decision tree algorithms have been presented to deal withcost-sensitive classification. Existing algorithms deal with only nominal data, while there isan amount of numeric data in applications. This dissertation develops a decision treealgorithm based on C4.5to tackle the problem and presents some techniques to improve theeffectiveness of our algorithm.This dissertation includes two parts.In the first part, we study the attribute reduction based on cost-sensitive rough sets. Onthe one hand, we develop two algorithms based on ant colony optimization and simulatedannealing to tackle the minimal test cost attribute reduction problem. Experimental results show that our algorithms outperform existing algorithms significantly. On the other hand, wedesign an algorithm based on the simulated annealing for the minimal cost feature selectionproblem. Experimental results indicate that the new algorithm is effective and efficient forlarge datasets.In the second part, we study the issue of cost-sensitive decision tree. Firstly, we developa cost-sensitive decision tree algorithm based on C4.5to deal with the cost-sensitiveclassification. Experiments indicate that our algorithm can deal with the cost-sensitiveclassification of numeric data. Then, the competition approach and the post-pruning areemployed to improve the effectiveness of the cost-sensitive decision tree. Experimentalresults indicate that the competition approach and the post-pruning technique improve theeffectiveness of the cost-sensitive tree. Thirdly, we develop the probabilistic pruning strategyto decrease the average cost of decision trees. Two probabilistic techniques of post-pruning,namely the static probabilistic pruning and the dynamic probabilistic pruning, are proposedfor cost-sensitive decision tree. Experimental results demonstrate that the probabilisticpruning outperforms the post-pruning without probabilistic mechanism.
Keywords/Search Tags:cost-sensitive learning, rough sets, attribute reduction, decision tree, pruningtechnique
PDF Full Text Request
Related items