Font Size: a A A

Research Of Decision Tree Algorithm Based On Rough Set Theory

Posted on:2009-10-02Degree:MasterType:Thesis
Country:ChinaCandidate:M L LiuFull Text:PDF
GTID:2178360242481573Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data Mining, also known as the"Knowledge Discovery"from database, is always used to extract hidden, unknown, and potential valuable knowledge from database or data warehouses using the technology of database, machine learning, artificial intelligence or statistics. It serves for the people's decision-making and provides strong foundation and support for the development of science. In the field of data mining, classification is an important technology, it carries out analysis and study on large amount of data and build fields related models. At present, there have been all kinds of classification models. Decision tree algorithm has been widely focused and applied because it is efficient, high-speed, easy-to-understand. This paper is also to focus on decision tree algorithm.Rough set theory is an intelligent data analysis tool for processing inaccurate, uncertain and incomplete information. It treats knowledge as the partition of data and every partition of set as concept. It approximately depicted the uncertain or inaccurate knowledge by the known. The research or rough set theory is one of active fields in information science. This paper combine rough set theory and decision tree algorithm design a standard of attribute selection and pre-prune the decision tree through adjusting the value of threshold. For the currency of data mining result model, we do some research on data mining standards, design and develop a plug-in based data mining platform. Depending on this platform, we test and compare the performance of algorithm and get the data mining models.This paper first reviews the background of data mining in chapter one and two, summarizes the development history of decision tree algorithm, analyses its future. Then details the related theory of decision tree algorithm: attribute selection standards, pruning strategies and evaluating criteria of classification models. At last, we list some classical decision tree algorithms, analyze and evaluate them.In chapter three we bring forward a new decision tree constructing method. First, we introduce the related concepts on rough set theory, leading to the definition of attribute approximate precision. Approximate precision is the ratio of up approximate set and down approximate set and describes the certain degree of condition attribute to decision attribute, the larger, the more. Because of the number of every equivalent category in decision attribute is different, we take the proportion of the equivalent category in sample data as coefficient of approximate precision. In order to eliminate the tendency of more value attribute, taking into account the subsequent classification factor, divide the logarithm of the condition attribute categories count. All these construct the metrics of attribute choice standards.Because of rough set theory does not contain a mechanism for dealing with noise data, it may not fit the practical problems simply by the approximate precision. For weakening the impact of noise data and isolating point for the bound, we extend the approximate precision to variable approximate precision using variable precision rough set theory. We set a threshold to regulate their impact. We also give the concept of variable precision positive region. When all the data of splitting node belong to its variable precision positive region, stop splitting the current node in order to avoid excessive data simulation and reduce the impact of noisy data.We implement the VAPDT plug-in based on the data mining platform discussed in chapter four. In order to test the performance of algorithm, we compare our method with information gain ratio method using the data from UCI. The experiments result show that our method have similar or better classification correct result in most cases. When the count of sample data is large or the sample data contains noisy data, we can adjust the threshold to weaken their effect. In the experiments, the threshold is set 0, 0.05, 0.10, 0.15, 0.20. Although different data sets have different threshold to make them have the best classification correct rate, in most case the threshold of 0.05 has a good experiment result.In chapter four we mainly introduce the knowledge about data mining standards, design and implement the PMML standard interface plug-in and some other function plug-ins. The standardization of data mining tools is the developing trend of future. At present, there are mainly four types of data mining standards: process standards, interface standards, language standards and network standard. PMML standard, as the definition language of data mining result model, solves the problem that exchanges models between heterogeneous data mining tools and deploys result models. It has been widely used now. We design and implement data mining platform based on Eclipse plug-in technique according to the PMML standard. I mainly participate in designing the framework and implementing the PMML standard interface plug-in, database operation interface plug-in, data source import plug-in and the decision tree algorithm plug-in.Decision tree algorithm based on rough set still has some issues that require further study. In the last, we give a summary of our work and the direction of our next study step:First, our algorithm is based on upper and lower approximate set, so it can only deal with the discrete sample data, how to extend our algorithm to adapt it to continuous data or how to discrete continuous data is the focus of our future research.Second, we pre-prune the decision tree in the process of constructing using the variable precision positive region. Prune the tree through adjusting the threshold. The effect of threshold for the tree needed to further study, at the same time we can consider combining the pre-pruning algorithm with after-pruning algorithm to control the size and correct rate of the tree.Third, we have applied the PMML standard to the preservation of our data mining result models in this paper. We should continue to apply other standards (such as interface standard, process standard) to our platform in order to achieve the openness of our system.
Keywords/Search Tags:Algorithm
PDF Full Text Request
Related items