Font Size: a A A

An Improved Attributes Discretization Method

Posted on:2008-04-19Degree:MasterType:Thesis
Country:ChinaCandidate:H Z WangFull Text:PDF
GTID:2178360242460244Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In the field of machine learning, the discretization of continuous attributes is not attached recognition as a fringe and assistant work. With the rapid development of knowledge discovery and data mining, the discretization of continuous attributes attracts attention in the last years. Many applications in practice are relative to continuous attributes, but many algorithms of machine learning need to deal with discrete attributes. With the development of the research on machine learning and data mining, many algorithms for disposing discrete data are proposed, such as decision tree, association rules and so on. The continuous data need to be dispersed. Therefore, people realize that continuous attributes need to be dispersed for dealing with these practical problems. Hence the problem of discretization is researched widely and deeply and many kinds of discretization methods are proposed. Ordinary discretization methods only compartmentalize the continuous data to some intervals for discretization to the machine learning algorithms simply in spite of the influence to the executive process of machine learning. These discretization methods are likely to be a potential mischance for subsequent task on data mining because this discretization can bring on losing the key information. In one word there are three aspects of demands the discretization of continuous attributes. The first one is to reduce the groups of data. The second one is the adaptability of algorithms. And the last one is to improve the ability of learning. The learning ability of some machine learning algorithms is improved by the discretization to a certain extent.In rough set theory, attribute discretization is one of the key issues. Firstly, this paper reviews the rough set theory which is proposed by Polish professor Z.Pawlak. It indicates that this theory is a mathematic tool that depicts the incomplete and imprecise character. This theory can not only effectively analysis all kinds of imperfect information which is imprecise, inconsistent and incomplete, but also analysis and deduce data. The connotative knowledge is found out and latent rules are opened out. The main idea of rough set theory is to depict the imprecise and uncertain knowledge by the knowledge in the foregone repository. This paper introduce some conceptions in rough set relative to the algorithm which consist of aim domain, knowledge, partition, repository, equivalence relation, equivalence class, upper approximation, lower approximation and tolerance of decision table. And these conceptions are the theoretic base of the latter algorithm for cut point's computation. Secondly, the origin and development of information entropy are expounded in this paper. And this paper introduces the elementary conceptions of information entropy function which consist of amount of self information and information entropy. The elementary conceptions of information entropy function consist of symmetry, nonnegative, extremum, addition and expansibility. This part is the base of our algorithm. Finally, the classes of the discretization methods of continuous attributes in rough set are described in this paper. One class is adopting the discretization methods of other domains. But these methods are not in view of the particularity of rough set theory. Using these methods to disperse the decision table can easily debase the tolerance of the decision table and the generalization ability to extract rules. The other class is promoting many new discretization methods in view of the characteristic of rough set. There are two kinds in these methods. The idea of one kind method is to retain the tolerance of decision tables in discretization. And the idea of the other kind method is only to consider the disciplinarian of data and not to consider changing the tolerance of decision tables as a criterion. In this way there are lesser sets of cut points. And then the essential ideas of several kinds of typical discretization algorithms is reviewed which consist of equivalent distance partition, equivalent frequency partition, NaiveScaler algorithm, SemiNaiveScaler algorithm and discretization algorithms based on Boolean logic and rough logic. This paper first analyzes the current attribute discretization methods, and put forwards a cut point set computing algorithm. The algorithm not only ensures the distinguish policy of decision table, but also reduces the initial cut point set records. Then the cut information entropy is defined, and using it as the importance measurement of cut point. Based on the above method, an attribute discretization algorithm based on the rough set theory and information entropy is put forward. The algorithm not only ensures the compatibility of decision table not change, but also considers the mixed decision table which contains the continuous and discrete attributes. In the experiments, the effectiveness of the method is detected and compared with other methods. The results show that the method is effective when the cut point's number increasing to very large.Certainly the research of this paper has some limitation. For example, how to deal with the imperfect data? How to improve the performance to deal with the problems of the large scale data? Future work includes: (1) Improve this algorithm, make it can be used to handle uncompleted data, and make the improved algorithm is practical. (2) Combine the genetic algorithm and rough set theory, thus improve the efficiency when dealing with large scale data.
Keywords/Search Tags:Discretization
PDF Full Text Request
Related items