Font Size: a A A

Research On The Improvement Of Decision Tree Reduced Error Pruning Algorithm

Posted on:2021-05-27Degree:MasterType:Thesis
Country:ChinaCandidate:Y X LuFull Text:PDF
GTID:2428330602984004Subject:Statistics
Abstract/Summary:PDF Full Text Request
Decision tree is a basic method for classification and regression problems.This paper mainly discusses classification decision tree.The decision tree is a tree structure model which starts from the root node and grows into a big tree with many nodes.The decision tree follows the "divide and conquer" rule.In the classification problem,it represents the process of classifying samples based on the feature variables.For the data points on the specified feature space,they can always be allocated to the sub nodes step by step along the root node of the decision tree and finally reach the leaf node,which represents the classification of the data point.Decision tree model is widely applied to classification problems for the following advantages:simple calculation,understandable results,not sensitive to missing values and can handle irrelevant characteristic variables etc.Nor-mally the process of constructing a decision tree involves three steps:feature selection,growing the tree and pruning.According to different standards in the feature selection process,decision tree can be categorized into three class-es:ID3,C4.5,CART.Differences among these classes are almost negligible in practice.Decision trees tend to be huge and complicated when dealing with prac-tical problems.However experiences show that big trees behaves poorly in interpretation and predictive accuracy,which suggests an over fitting tenden-cy.So it is necessary to prune the tree to obtain a better model.Pruning can simplify decision tree,improve generalization performance and avoid over fitting of the training set,which is an important research content in decision tree learning.Research shows that pruning can even improve the generalization performance of decision trees by 25%,when there is noise in the data.Current popular pruning algorithms include reduced error pruning,cost complexity pruning,pessimistic error pruning and minimum error pruning etc.In this paper,an improved scheme of REP(reduced error pruning)algo-rithm is proposed.Based on the original REP algorithm,the calculation of chi-square statistic and multiple test correction are added,and the simulation data and actual data sets are compared with the existing common pruning al-gorithm.The results show that the improved rep pruning algorithm can better improve the generalization performance of decision tree without substantially increasing the cost of computation and data storage.
Keywords/Search Tags:decision tree, pruning, reduce error, chi-square statistic, multi-ple test, over fitting, generalization performance
PDF Full Text Request
Related items