Font Size: a A A

A System Of Decision Trees Construction

Posted on:2004-03-08Degree:MasterType:Thesis
Country:ChinaCandidate:L Q ZhangFull Text:PDF
GTID:2168360092496992Subject:Computer applications
Abstract/Summary:PDF Full Text Request
Classification is an important problem in data mining. In classification, we are given a set of example records, called a training set, where each record consists of several fields or attributes. One of the attributes, called the class label, indicates the class to which each example belongs. The objective of classification is to build a model of the classifying attribute based upon the other attributes. Once a model is built, it can be used determine the class of future unclassified records. Decision tree is one of the most popular classification tools. A decision tree is a flow-chart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and leaf nodes represent classes or class distributions. The method that a decision tree model is used to classify a record is to find a path that from root to leaf by measuring the attributes test, and the class labeled on the leaf indicates to which the record belongs.The present study intends to set up a integrated system of decision tree construction. The system includes five phases: data-preprocessing phase, tree growing phase, tree pruning phase, tree analysis and evaluation and the extracting rules phase. In the data-preprocessing phase the works mainly involve data cleaning to reduce noise or handle missing values, relevance analysis to remove irrelevant or redundant attributes, and data transformation, such as generalizing the higher-level concepts or normalizing the data. In the tree-growing phase, we evaluate each attribute recursively by some attributeselection measures and choose the best split attribute and the splitting value, then get a full growth decision tree. We have applied three algorithms: the information gain method, the Gini Index and the Relief method and compared their performances in the experiment result. In the tree pruning phase, to preventing the "over-fitting" problem and improving the accuracy rate, we must prune the tree grows fully in the tree growing phase. We choose the Minimum Description Length algorithm, because the MDL pruning algorithm achieves good accuracy, small trees, and fast execution times.In the tree analysis and evaluation phase, we estimate the performance of the generated decision tree in some ways. We have applied holdout method and 10-fold cross-validation method in the system. They estimate mainly the accuracy rate of the model. In the rules extracting phase we caneasily covert the model to the classification IF-THEN rules.From the result of the experiment we can get that, the Information gain measure and the Gini Index are similar on both the run speed and generate nodes, and they are quicker than the Relief algorithm. But the Relief algorithm is better than the other two in some special data. We have applied them on the revenue information system and got some satisfied results. The aim of study is to construct a compact, low error rate, comprehensible and scalable decision tree construction system. There are some limitations in the system on the run efficiency, integrate with the data house and analysis on the complex data etc. We will keep on improving the performance of the system in the future.
Keywords/Search Tags:classification, decision tree, feature selection, tree pruning
PDF Full Text Request
Related items