Font Size: a A A

Research Of Data Mining Based On Rough Set

Posted on:2011-09-29Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhangFull Text:PDF
GTID:2178360305955228Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years, with the rapid development of the information technology (IT), the information resources increase fast, and then there are lots of data needed to be converted to useful information and knowledge. Thereby the technology of data mining (DM) has been hot spots of recent research. DM is a process of extracting implicit, useful and finally understandable patterns or knowledge from a number of data. Rough Set Theory which was first presented by a Polish mathematician in the 1980s, is a mathematical tool of describing the uncertainty and incompleteness of problems. Because of its irreplaceable advantage, rough sets have been used successfully in such areas as artificial intelligence, information systems analysis, knowledge discovery and pattern recognition.As a common method of DM, rough sets are of great importance in classification algorithms of DM. The rules of classification depend on the dependent relationship between the data. If the data are in the border areas, the condition attribute and the decision attribute are in a weakly dependent relationship, and it means that we can only definite the decision attribute of certain probability. In such a case,we need to definite the minimally dependent relationship, and remove the redundant attributes. Core idea of rough sets is to derive the rule of classification by the reduction of knowledge in the situation of keeping the capability of classification unchanged, and ensure to obtain a satisfactory approximation of classification. But for the decision-making systems, to find the most simple reduction is an NP hard problem,and in addition the reduction obtained from deriving the core of rough sets is not the minimal reduction. In this situation, the data system still has some attributes that have nothing to do with the decision attribute and affect the precision of classification. For this problem, this article presented a method by which one can carry out further removal of redundant attributes according to rough-set classification algorithms of entropy, and improve the precision of classification. When we have a further reduction, we can make use of the concept of mutual information which can effectively show the amount of information transmitted from one variable to another. By setting the threshold of the effective value, we can remove the attributes which give less information to the decision attribute.According to the research of rough sets and mutual information, we improved the rough-set classification algorithms through using the mutual information, and represented two improved algorithm:I-RS(rough-set classification algorithm based on mutual information),BI-RS(rough-set classification algorithm based on generalized mutual information).In the I-BS algorithm, we derive the core of training set corresponding to the decision table using structured decision-making matrix, and through the core obtain a reduction of the training set by deleting the redundant attributes. After that though mutual information, according to the degree of the importance of the condition attributes in the decision table to the decision attributes, we can do a further reduction to the condition attributes, and we can also obtain the decision rules after the reduction of the attribute value. Through I-BS algorithms, we can obtain a reduction just from a training set, and when we classify the test sets later, we can do the same reduction to the test set. After that through the rules we can estimate the decision value of the record of the test set. BI-RS is a classification algorithm based on test data bit-oriented. In the BI-RS algorithm, we also do the reduction to the training set through the core. But different from the I-RS, in the BI-RS, we don't do a second reduction,until we classify the single record of the training set:for a single record, according to generalized mutual information conception and the data in the training set, we can definite the information amount of different attribute value to the decision attribute, remove the attribute of little information amount,do the same reduction to the training set,and obtain the rule. In comparison, I-RS is the reduction to the whole attribute from a macro perspective, while BI-RS is the reduction to each different data.This first chapter of this article introduced the background and significance of this study and interpreted the functions, methods and classification algorithms of DM.Chapter II interpreted the concept and application of rough sets which is very important in DM, and interpreted the knowledge classification, knowledge reduction and reduction of decision table in Rough Set Theory.The third chapter introduced some of the concepts of information theory. Considering the problem that using classification algorithm for rough sets one can't obtain a minimal reduction to decision table, we presented I-BS and BI-RS algorithm. The two algorithms both do a second reduction to decision table, and remove redundant attributes further, which reduces the noise and interruption in the classification, and improve the precision of classification. The results of the study sufficiently prove the effectiveness of the two algorithms. From the results of several tests, the precision of BI-RS is higher than I-RS, but because that BI-RS needs to do a calculation of decision rule to each record of the data,the efficiency of BI-RS is less than I-RS.BI-RS is more suitable for the data,the amount of which is not large, and demanded precision of which is high. Finally, we discussed the importance of choosing threshold in the algorithm.The fourth chapter introduced entropy-based discretization method of continuous attributes considering that BI-RS cannot deal with continuous data. Because this algorithm has little influence in the compatibility degree of a decision table, this method in rough set application has a good effect in rough set application. Through experimental analysis the data obtained by information entropy discretization of continuous attributes are better in BI-RS than other algorithms.The final chapter introduced the deficiencies of the algorithm and future research directions.In this paper, based on rough set classification algorithm, combined with the concept of mutual information, we proposed two classification algorithms:I-RS and the BI-RS. Following test verification of these two algorithms we have achieved good results:the effectively removal of redundant attributes and obtaining the classification rules, which make it easier to our data classification. This study will provide the follow-up analysis for a solid foundation.
Keywords/Search Tags:Data mining, Rough set, Comentropy, Attribute reduction, Mutual information, Generalized mutual information
PDF Full Text Request
Related items