Font Size: a A A

Research And Application On Frequent Substructure Mining Algorithms

Posted on:2012-10-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:H B LiFull Text:PDF
GTID:1118330362955191Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In the field of chemical information, biloloy information, web analysis, XML data emerging recently, structural data types, such as trees and graphs, are needed to reprent infomation. Data mining in these structural data types, is helpful to gain information and knowledge. Mining frequent itemset is one most primary method of data mining. How to efficiently mine out the frequent substructure patterns in the data set of structural type data, is a challenging problem.By analyzing of existing efficient algorithm for mining frequent substructures, we found that the core idea of these algorithms can be divided into the joining method based the Apriori principle and the extension method based on pattern growth. By analyzing and absorbing the two methods, we proposed a new hybrid PJE method. Applied PJE to different types of structured data for mining frequent patterns, we propose new efficient algorithms for mining frequent substructure. Finally we proposed a new method for substructure indexing, and implemented a chemical structure database prototype system in a relational DBMS.During minig frequent rooted unordered trees, minimum depth first sequences are used as canonical forms of pattern trees. By extending based on prefix nodes of trees, new candidated pattern trees can be gained in const time. Enumerating candidated pattern trees by depth extending and width joining, the Apriori rule can be utilized to reduce the number of candidated pattern trees. For the pattern trees been enumerated, the Apriori rule can still be used to prune, so the number of candidated pattern trees can be cut down further. Using canonical embedded occurring list to record the occurrence of pattern trees in graph databases, and from it to count the frequency of occurring of pattern trees, will not only avoid complete subgrah isomorphism testing, but also save large memory. Putting all discussed above together, a frequent rooted unorderd trees mining algorithm, called PJE, is proposed. By performance testing in both synthetic dataset and real data set, it is proved that the efficiency of substructure data minig has been improved in some degree.In the minig of frequent free trees, centre nodes and bicentre nodes of free trees are defined first. Free trees can be transformed into rooted unordered trees by using the centre of bicentre nodes as root nodes. By finding out backbone path and backbone string in free trees, minimum backbone-first depth-first sequences of free trees are defined. This kind of sequences are used as canonical forms of free trees. Based on this type of canonical forms, depth extending and width joining using prefex nodes can be done on the pattern free trees. New candidated pattern free trees can be acquired in const time by the mthod. For the pattern free trees been enumerated, the Apriori rule can still be used to prune, and the frequency of occurring of pattern free trees can counted by using canonical embedded occurring list. Base on all methods discussed above, a frequent free trees mining algorithm, Free-PJE, is proposed. By performance testing in both synthetic dataset and real data set, it is proved that the efficiency of substructure data minig has been improved in great degree.In the mining of frequent subgraphs, a graph is devided into a graph core which not containing leaf nodes, and a branch forest which not containing ring structures. The connction vector specifys the nodes where the branch subtrees are conneted on the graph core. The minimum"core-branch-conection vector"3-tuples are used as the canonical forms of graphs. Frequent pattern graph core are gained by patten extending , and all minimum connection vectors are enumerated at the same time. By a specified graph core and it's minimum connection vectors, graphs can be considered as a virtual rooted unordered trees. depth extending and width joining using prefex nodes can be done on the virtual trees, and new candidated pattern graphs can be gained in constant time. Prune based on the Apriori rule and and counting of frequency of occurring using canonical embedded occurring list can be done. A frequent subgraph mining algorithm, Graph-PJE, is proposed. By performance testing on both synthetic dataset and real data set, it is proved that the efficiency of substructure data minig has been improved greatly.To improve the response time of graph query, it's needful to build graph index in graph database. Feature subgraphs in the graph database and their transaction occurring list can be used to build graph index. Graph query acquired candidated query result set by graph index and query graph, then each of the candidated result graphs is verified whether contains the query graph completely. The result of frequent subgraphs mining can be used as graph index, and the number of the candidated query result set can be preserved less than the minimum support in the frequent subgraphs mining. Common prefix index tree and effective transaction occurring list can be used to reduce the size of the frequent subgraph index. In real mocular structure databases, 6-rings and 5-rings can be regarded as virtual atoms. After the mocular structure database been reconstructed, the size of the frequent subgraph index can be reduced further greatly. By experiments on real data sets, it's proved that the frequent subgraph indexes have high performance.Using the new frequent substructure indexing and querying methods, we designed and implemented a chemical database prototype system in DM DBMS. In DM relational database, the chemical structure data and chemical structure index data has been stored by using relational tables; the functions to store, index, query chemical structure of data has been implemented by using external stored procedures.
Keywords/Search Tags:frequent substructure mining, PJE Algorithm, Root-PJE Algorithm, Free-PJE Algorithm, Graph-PJE Algorithm, 3-tuple of core-branch-connction-vector, frequent subgraph index
PDF Full Text Request
Related items