Font Size: a A A

Research On Data Mining Algorithm Based On Rough Sets

Posted on:2010-01-23Degree:MasterType:Thesis
Country:ChinaCandidate:X J WangFull Text:PDF
GTID:2178360278497089Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the extensive application of database technology, the amount of data in the database increases rapidly. In order to find out laws and models to help people make better use of these data for decision-making, the concept of knowledge discovery and data mining is proposed. Data mining is the most critical steps in knowledge discovery, but also the technical difficulties in knowledge discovery, is the very active area in research nowadays. The theory of rough sets, presented by Polish mathematician Pawlak Z., is a powerful mathematical tool for analyzing uncertain, fuzzy knowledge. Rough sets, as a new hot spot in the field of artificial intelligence, can effectively deal with the expression and deduction of incomplete, uncertain knowledge. The theory of rough sets is specially fit for the application to data mining because of its features. Its validity has been confirmed from the successful use in various science and engineering domains in recent years.Decision tree is the most universal model adopted in classification. The univariate decision tree is confined to test of only single attribute at each node, which has the follow problems: ignore the relation of attributes; some sub trees appear repeatedly in the decision tree; some attributes are measured for many times on certain route of the decision tree. In order to overcome the defect, the learning method of multivariate has been proposed, which can test several attributes simultaneously at one node. This method produces the new attributes which are more relevant, and revise or remove independent attributes. The key problem is the standard for selection and test of the nodal attribute. Preprocessing to the massive data are also the critical technique.This dissertation studies the theory of multivariate decision tree, with the following main research results:1. A new concept of similarity degree of attribute importance is presented. The attribute importance, as the weighted value, is integrated into the traditional formula of similarity. It overcomes the only consideration of quantitative change of distance, but not the attribute importance. Moreover, it accords with the reality and the calculation is simple.2. Preprocess the data to make the data mining more effective. Reduct the attributes by the classical simplification algorithm of the discernibility matrix to compress the dimension. Calculate the similarity degree of data objects each other, and put the ones whose similarity degree is bigger than the threshold into a group. Select one from each group to form a new sample of data to decrease the redundant ones.3. The attribute selection criterion, based on the attribute sets importance, is proposed, setting the number of attributes at each node to be two at most. It conquers the shortcomings of traditional decision tree algorithms at deflection problems in selecting testing attribute. Less computing time is acquired while the height of decision tree is compressed and rules are more comprehensive.4. A concept of relative generalization of one equivalence relation with respect to another one is introduced and used for construction of multivariate tests to avoid the overfit of data.Based on the former work, algorithm based on the rough sets for multivariate decision tree is put forward. The comparison between multivariate decision tree and univariate one is done through an example. The comparison among several multivariate decision trees is fulfilled .And it is verified with instance and experiments that the algorithm is advantageous.
Keywords/Search Tags:data mining, rough sets theory, multivariate decision tree, similarity degree of attribute importance, relative generalization
PDF Full Text Request
Related items