Font Size: a A A

Research On Decision Tree Induction And Pruning Algorithm

Posted on:2008-09-01Degree:MasterType:Thesis
Country:ChinaCandidate:L M WangFull Text:PDF
GTID:2178360215973867Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Data mining is the most significant field of information technology, which is combined with many theory and technology such as database, artificial intelligence, machine learning and statistics. Classification of data is the most active and mature topic in data mining because it has been successfully used in business analyses. At present, there are many techniques for classification, such as decision tree induction, association rule, Bayesian networks, neural networks, rough sets and statistical model and so on. Among these techniques, the decision tree method has been widely researched and applied for its high predictive accuracy and executive speed. What is more, it can be easily transformed into "If-Then" rules of classification.The thesis mainly demonstrates how to construct decision tree with learning from a training set and how to solve the common problems that will occur in it. The basic theory and the learning process are discussed, including ID3 and other improved versions: C4.5, SLIQ and SPRINT. According to each algorithm's advantages and disadvantages, the analytical comparison also is demonstrated. Combined with data warehouse and other techniques of data mining, the application of decision tree can be expanded to a large domain.In other hand, learning a decision tree through the given training set may not lead to the tree with the best generalization performance. The noises in the training set can make the decision tree over-fit it and reduce the accuracy of classification. Moreover, the algorithm might be making some decisions toward the leaves based on very little data and may not reflect reliable trends in the training data. Generally, we exploit pre-pruning and post-pruning methods to avoid over-fitting. There are five methods for post-pruning, Reduced Error Pruning (REP), Pessimistic Error Pruning (PEP), Minimum Error Pruning (MEP), Cost-Complexity Pruning (CCP) and Rule Post- Pruning. In order to select a suitable pruning method, these different pruning methods are compared in terms of computational complexity, error estimation, and theoretical principle.Decision trees are always tend to over-fit the training data, to have lager scales and to induce longer classification rules in that the tree induction algorithm adopts greedy method. In order to improve predictive accuracy, the thesis briefly introduces some methods to optimize decision trees.Finally, the thesis analyses each questions around the decision tree pruning using See5.0, including node numbers, classification error rates and predictive accuracy. The empirical datum indicates that the pruned tree can eliminate over-fitting partially and is generally easier to understand. What is more, the scale of the tree reduces obviously and the predictive accuracy in testing sets increases after pruning. If the scale of training set is not large, PEP method shows its higher accuracy than other methods. As the number of training data increases, the classification performance of the decision tree through CCP method is not less good than the one through REP method. Practically, when the training data is abundant REP method is preferable, while PEP method is a good choice when the training data is limited but with high expected accuracy.
Keywords/Search Tags:Classification, Decision tree, ID3 algorithm, Over-fitting, Post-pruning
PDF Full Text Request
Related items