Font Size: a A A

A Study On Rough Set Theory And Discretization Of Real Value Attributes

Posted on:2009-01-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y SangFull Text:PDF
GTID:2178360275961086Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The traditional rough set theory can only deal with the discrete attribute in a database, so it is necessary to deal with the consistent attribute when the consistent attributes exist in a database. The discretization of continuous feature values is an effective technique to deal with continuous attributes for machine learning and data mining and is always with great contribution to the followed process of machine learning or data mining. Reasonability of a discretization process is determined by the accuracy of expression and extraction for information. A series of Chi2 algorithms are famous discretization algorithms with the base of probability and statistics and the correlate algorithms based on Class-Attribute Interdependency are are famous discretization algorithms with the base of information theory. Discretization algorithm for real value attributes is of very important uses in many areas such as intelligence and machine learning.First, by analyzing a series of Chi2 algorithms, we propose a new algorithm called Imp-Chi2 algorithm which is based on attribute significance. The algorithm reasonably adjusts the sequence of disretization for attributes according to the level of attribute significance, and exactly discretes the real value attributes. In the process of the experiments, we present a selection method of training set according to class proportion. The method greatly overcomes the bad-distributed situation for random selection of training set.Second, algorithms of the correlation of Chi2 algorithm are analyzed, and the drawback of the algorithm is pointed. Based on the analysis a new modified algorithm called Rectified Chi2 algorithm is proposed. The new algorithm regards a new merging standard as basis of interval merging and discretes the real value attributes exactly and reasonably. To solve the problem that all the algorithms of the series of Chi2 algorithms only adopt maximal difference as standard of interval merging, a difference sequence method is proposed which is having better performance manifested by experiments than that of the series of Chi2 algorithms.Third, Algorithms of the series of Chi2 algorithm (include Modified Chi2 algorithm and Extended Chi2 algorithm which is up to data algorithm in this domain) are studied, and the drawback of the algorithm is pointed. Based on the analysis a new modified algorithm based on interval similarity is proposed. The new algorithm defines a interval similarity function which is regarded as a new merging standard in the process of discretization. At the same time, two important parameters (condition parameterαand tiny move parameter c ) which embody respectively equilibria in the process of discretization and discrepancy extent of number of adjacent two intervals are given in the function. Besides, two important prescriptions are given in the algorithm. The new algorithm, realizing fair standard and discretizing the real value attributes exactly and reasonably, not only can inherit the logical aspects ofχ~2 statistic, but also can resolve the problems Algorithms of the correlation of Chi2 algorithm have been.Finally, the approaches for the correlate algorithms based on Class-Attribute Interdependency are studied systematically, a new Class-Attribute Interdependency (CAI) discretization algorithm for real value attributes based on Rough Set theory and Mutual Information is proposed. The new algorithm redefines a discretization criterion( NCAIC——New Class-Attribute Interdependency Criterion ) which considers data distribution and the interdependency between all the classes and attributes, and regards upper approximations conception in Rough Set theory as an important part of the discretization criterion. The Class-Attribute Mutual Information automatically controls and adjusts the extent of the discretization of continuous feature values, which makes discretization of real value attributes exactly and reasonably.
Keywords/Search Tags:Discretization of real value attributes, Rough sets, Chi2 algorithm, Attribute significance, Selection of training set according to class proportion, Difference sequence, Interval similarity, Mutual Information, Class-Attribute Interdependency (CAI)
PDF Full Text Request
Related items