Font Size: a A A

Research On Discretized Data Methods Based On Rough Set Theory

Posted on:2010-07-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y L ChengFull Text:PDF
GTID:2178360272996622Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The rapid development of computer network and information technology has brought great changes to human society. Data has become an extremely important strategic resource. In recent years, a large number of large-scale databases grew quickly with the promotion of e-commerce and e-government, the emergence of open resources, and the invention of high-speed data acquisition tool. As the size, scope and depth of database application constantly expanding, the amount of data information makes an exponential growth.It's reported that the rate of growth of the global information amount doubled per 20-month in 1980s. In the 1990s, however, with the development and popularization of the Internet, and the generation and application of corporate intranet, extranet, and virtual private network, the whole world through the Internet created a global village. People can transcend time and space through the Internet to exchange information and work together. In nowadays, database showing in front of people is not only confined to the sector, units and local industry, but all over the world.Faced with the scale of the fast-growing information, data mining has become the focus of attention of researchers. As the most important part of knowledge discovery, data mining extracts the knowledge in which users interested from the huge amount of data. However, the data collected in the real world are often not suitable for direct application to data mining process because these large-scale data are often contain incomplete and inconsistent information, and generally have a continuous feature space.Because the data collected from the real world often contain noise, and may be inaccurate or incomplete, this article gave a developing discretization algorithm for continuous attributes from the rough set based on the decision table and information entropy. Rough set is a portrait of an incomplete and non-deterministic mathematical tool, which can effectively analyze incomplete information and data, reason and find hidden knowledge, and reveal the potential of the law. Rough set theory is built on the classification on the basis of the mechanism; it treats the classification as a specific space in the equivalent relationship which constitutes the division of space. The main idea of rough set theory is to use the already known knowledge to describe the imprecise or uncertain knowledge. Between rough set theory and other theories which deal with uncertainty and imprecision, the most prominent difference is that rough set theory don't need any priori information except the data set needed to deal with. So, the description and disposal of the uncertainty of the problem can be said that more objective.The nature of the process of data discretization is using some breakpoints to set on the attributes of decision table space systems division. In order to improve the clustering capability of the system, enhance the robustness to data noise, the breakpoints should be used as little as possible. From this point of view, based on the united discernible relation, using the least breakpoints on the system is the most optimal discretization. Therefore, the number of those which are really meaningful and going to be the breakpoint (breakpoint effective follow-up) will affect the calculation of the solution of a subset, thus affect the efficiency, competence, time and space overhead of the whole algorithm for discretization.In this paper, based on the decision table and rough set theory, a new improved algorithm for discretization of continuous attributes is put forward and proposes a new idea of"weight of condition attribute"and"projection of equivalence class". Through determine the importance of condition attribute to the decision table, compare the relationship between the value of condition attribute and projection of equivalence class, rapid rule out unnecessary candidate breakpoint, and then set a breakpoint candidate optimization to improve the algorithm efficiency, saving timeIn order to measure the importance of candidate breakpoint, further improve the quality of the result breakpoint, after the candidate breakpoints selected, this paper uses the algorithm of information entropy to select the result breakpoints, which is based on the work of Xie Hong, Cheng Hao-Zhong etc[21] , using information entropy of rough set to algorithm the Discretization of Continuous Attributes".Information entropy shows the capability of describing the information in the information system. The smaller the information entropy shows that the collection of individual attribute values in decision-making dominated was stronger, and therefore, the degree of confusion was smaller. Particularly if and only if the decision-making examples of the collection have the same attribute values, the information entropy equal to zero. This nature guarantees the discretization algorithm of this article can not change the compatibility degree of decision-making table. Assuming L={ }Y1 , Y2,?Ym is a equivalence class of the decision table which has been divided by the breakpoint set P, then after adding a new breakpoint c which does not belong to P, the new information entropy is: (,)=()+()++(). 12HcLHcHcHcY Y?YMThe smaller H(c,L) was, the more single the new subset of the equivalence class of decision-making after the division of the breakpoint was. So H(c,L) reflected the importance of c breakpoint level.In the calculation of information entropy, the breakpoints and the relationship between the collections of information entropy determined the way of calculation. In order to ensure the efficiency of algorithms and the accuracy of the calculation, defining a reasonable threshold is very important. Depending on the differences of application areas and methods of discretization, the size of the threshold can float in among reasonable limits. During the processes of discretization, when the information entropy calculated is less than the threshold, the division of the collection can end.Comparing with other algorithms through simulation experiments, the algorithm of this paper can do the discretization correctly and effectively, and is not sensitive with the total number of attributes of the decision-making. When the number of the sample and the properties increased, the algorithm still has high efficiency. The algorithm of this paper can be suitable for the mixed knowledge-based system which has continuous attributes and discrete attributes.Due to the complexity of practical problems, using any kind of discretization algorithms are very difficult to guarantee that all problem can be solved, when finally obtained knowledge systems are optimal, and the computational efficiency of algorithms is also decide whether to solve a prerequisite for large-scale problem, so to explore a new discretization algorithm is still very necessary.
Keywords/Search Tags:Entropy of Information, Rough Set, Continuous Attributes, Discretization
PDF Full Text Request
Related items