Font Size: a A A

Research Of Reduct Algorithm Based On Rough Set Theory

Posted on:2006-07-13Degree:MasterType:Thesis
Country:ChinaCandidate:X Y ChenFull Text:PDF
GTID:2168360155952949Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Knowledge of Discovery in Database (KDD) was founded in the late 1980's. The main reason is that it can deal with the huge volume of data increasing steadily and manage the choke point of the process of identifying knowledge in the information society. KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns from huge volume of data. Data Mining (DM) is a core step of KDD. Among the methods of KDD and DM, Rough Set is an effective method. Rough Set theory is founded by a Polish mathematician—Z.Pawlak in 1982, it is a new kind of mathematical tool to deal with Vagueness and Uncertainty problems. In all aspects of Rough Set theory, the reduct of information system and searching the minimal attribute subset are the most pivotal issues and the toughest problems. A reduct is the minimal attribute subset preserving classification power of the original data set, which reflects the essence of the information system. Unfortunately, it has been shown that finding all reducts or the minimum reduct are both the NP-hard problems. When the data sets get larger at both dimensions and volumes, the basic reduct algorithm will turn out to be extremely unpractical. We except to find a heuristic algorithm by which can get the reduct as short and effective as possible in a short time. It is exactly the heuristic searching issue. The thesis just discusses and tries to find a more efficient heuristic reduct algorithm. The effective reduct algorithm is the foundation of applying Rough Set theory in KDD and DM. Fining a fast reduct algorithm is still the main research issue of Rough Set theory and it is an important issue in the field of computer application science. At present, Rough Set theory has been successfully implemented in Artificial Intelligence, knowledge acquisition and etc. In this paper, we briefly discuss the reduct and other interrelated issues on the basis of Rough Set theory. The paper is organized as follows: The present paper mainly includes the basic concept and theory in Data Mining and Rough Set theory. And presents the classical reduct algorithms based on Rough Set theory. Based on the theory, we also look through books and papers on Rough Set theory and summarize the latest achievement and research areas. In the paper, we introduce the Petri theory and the simulator--CPN Tools. Based the Rough Set theory, we define the "Discriminable relation" and "Discriminable Matrix". We also propose several reduct algorithms, such as "Heuristic Reduct Algorithm Based on Evaluation Index". The heuristic algorithm chooses the bottom-up design. The algorithm uses Core as computational basis of reduct and seeks reduct with the heuristic information—Evaluation Index. According to the heuristic information, the algorithm adds attributes one by one into the attribute set until the set is a reduct. After that, discuss the performance of these algorithms respectively. The algorithm proposed here and the algorithm proposed by Hu which is a heuristic reduct algorithm using discernibility matrix and attribute frequency are both implemented by standard C++ language. We test these algorithms on 53 data sets for the feasibility and the rate of accuracy, and compare with the results by "Discernibility Matrix" method in Rose. The heuristic reduct algorithm proposed by Hu stores the whole discernibility matrix in memory. It can do well when the data set is very small, but it becomes unpractical when the data set gets larger. And the result with too much redundant information generally is a superset of the reduct. The algorithm proposed here has the higher rate of accuracy 83%.The algorithm cannot promise to find the reduct exactly every time, but can usually achieve the reduct with fewer attributes. The complexity of the algorithm in space has been reduced to O(m?n) or O(n2). And it can take little time to give the reduct when the data set is small. The algorithm is also effective to achieve result on huge volume of the data sets. We implement the parallel algorithm corresponding to the algorithm proposed by Jun Li by standard C++ language and the popular parallel programming tool—"MPI". We test the algorithm on the data sets as mentioned above. The results show that the parallel algorithm is efficient and practical. Furthermore, the paper analyzes the load balance by CPN Tools and achieves the satisfactory conclusions. On the whole, the reduct algorithm proposed here is feasible and have the better performance. At present, due to its own special advantage, Rough Set theory has been successfully implemented in all application fields. More and more researchers pay attentions in the special and attractive field. As a new direction, it has been one of the most active research fields. The worthy issues in the intending work are as follows: (1) The heuristic reduct algorithm proposed here can only manage the consistent system with one table, but in actual life there exist many...
Keywords/Search Tags:Algorithm
PDF Full Text Request
Related items