Font Size: a A A

The Study On KDD Technologies Based On Rough Set Theory

Posted on:2004-12-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:J ZhaoFull Text:PDF
GTID:1118360095456605Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Though rich achievements of the researches on technologies for Knowledge Discovery from Databases have been reported and seen recently, to further develop new technologies for KDD is still necessary in practice. Rough set theory is one of the most prosperous tools and theories that have already been successfully applied to resolve some specific problems in KDD process. Theoretically speaking, intelligent data analyses based on rough set model could be well done without any extra parameter or external knowledge, therefore rough set theory has incomparable advantages over other theories and tools. Consequently, to further develop technologies based on rough set theory for KDD is hopeful to provide more effective KDD resolution.In the data integration stage of KDD, data discretization is one of the most important jobs. Effective data discretization can obviously improve system ability on clustering instances, and can also make systems more robust to data noise. Rough set theory has been successfully applied in data discretization. Based on the typical heuristic frameworks of rough set based data discretization, some further researches are made in this topic. Firstly, new method of computing the candidate cut set of a learning system is put forward. To compare with other analogous traditional algorithms, candidate cut sets with much smaller cardinalities can be produced through the new method while the system discernibility relation could still be maintained. Subsequently, to heuristically measure the relative importance of candidate cuts, relevant metrics are studied based on "cut discernibility matrix". When measuring the importance of a candidate cut, both the characteristics of columns and rows of this matrix should be reasonably taken into consideration. It is deserved to point out that the contribution of column and row characteristics to candidate cut importance is unbalanced at all, the latter is much inferior. Then a new conception of "Cut Selection Possibility" is defined to effectively measure the importance of candidate cuts. Cut Selection Possibility is not only physically meaningful, but also fully considers the difference between the characteristics of matrix columns and rows, and then harmonically and reasonably put them together. At last, an approach based on Cut Selection Possibility is proposed to find out the result cut sets from candidate sets. To real life databases, theoretical analyses and simulation experiments show that the proposed approach can efficientlyand effectively solve the problem of data discretization.In the data integration stage of KDD, feature subset selection is another one of the most important jobs, which can not only diminish the data scale of a system, but also effectively remove the redundant information from the system and then emphasize and enlarge the potential data relation of the system. Consequently, it greatly contributes to improving the application performances of the data mining results. Technologies for feature selection is firstly deeply inspected, and then the notion of "System Entropy" is defined and the influence of a feature on System Entropy is taken to heuristically measure relative feature importance. The notion System Entropy can effectively break the confines of "Conditional Entropy", another notion also based on information theory and rough set theory. System Entropy can not only measure the relative importance of useful features, but also discriminate the relative importance of redundant features. Moreover, its computation is much easier and simpler than that of Conditional Entropy. Some algebraic characteristics of System Entropy are disclosed, and its intrinsic value biases are also studied. Then the conception of "Feature Significance" is clarified based on System Entropy after its value biases are effectively counteracted. Feature Significance heuristically measures feature importance in a new suggested algorithm for feature selection that selects feature subsets in typical "Backward Elimination" way. Algorithm analyses and simulati...
Keywords/Search Tags:Rough Set Theory, Data Discretization, Feature Subset Selection, System Uncertainty Measure, Decision Rule Induction
PDF Full Text Request
Related items