Font Size: a A A

Research On The Frequent Substructure Mining Algorithm For Graph Classification

Posted on:2012-08-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:X H ZouFull Text:PDF
GTID:1118330338991322Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
With the development of computer science and information technology,data mining technology has been widely applied to artificial intelligence,pattern recognition,bioinformatics and many other fields.The demand of mining on complex data is rising now. Experts have paid attention to these fields and tried to solve the problems by virtue of the experience of structured data mining.In computer science, the graph is one of the most complicated data structures. Its'rich semantic increases more complexity of the data structure and more difficulty for mining the interesting graph structures. The graph theory is often used together with various technologies in graph data mining. Graph can be intuitively presented and has a wide variety of applications both in research and business, e.g. CAD circuit analysis, molecular model analysis, finding user's interest in Web browsing and data compression. About graph mining, frequent subgraph mining is the foundation of graph classification and graph clustering. Accordingly, how to derive the frequent subgraph patterns from the great volume of graph-structured data became one of the hottest issues in data mining field. This is also the focus of this paper.Firstly, in this paper the basic concept of graph, the type of graph mining approaches and the classic frequent subgraph mining algorithm are introduced. Then the existing classic graph mining algorithms are analysed and pointed out the excellence and the inferior aspects. The problems of graph mining are also introduced in this section, the direction for future research is given.Secondly, to resolve the problem that gSpan algorithm will produce a lot of redundant subgraphs while extending frequent subgraphs, an improved algorithm CSGM is proposed, which using ADI++ storage structure instead of the adjacency list storage structure in gSpan, and meanwhile an approache of deleting the non-DFS code is also present. It can handle larger graph dataset and guarantee that a canonical code can be obtained at each extension. It not only avoids calculating the support of the non-canonical code graphs but also avoids the calculation of whether an input code is a canonical code or not, the new algorithm reduces the computation. Hash table is used to store graph's hash address and calculate the support of frequent subgraphs during mining, it can avoid scanning the databases repeatly and reduce the counts of subgraph isomorphism judgment. The experiments have been done on the actual and simulate date sets to verify the correctness, the efficiency and the ability to handle large graph databases .Afterwards, the result of frequent subgraphs mining is used as a feature candidate set, to select a small set of frequent and discriminative patterns for classification from it, an algorithm of selecting feature patterns is provided in this article. The objective function is given by introducing the discriminative measure of information entropy and information gain. The method for classification feature selection is explained. Meawhile, in order to reduce the searching space of graph patterns and improve the efficiency, both vertical pruning and horizontal pruning strategies are proposed. The qualityof mining results and running efficiency are verified by experiment.Finally, an algorithm of graph classification based on frequent patterns is provided. By analyzing the relationship between pattern frequency and its predictive power, such a analysis suggests a strategy for setting min_sup. Furthermore,the discriminative frequent subgraphs are used as feature patterns to construct the classification rules, the classifier is trained by associative classification, and the new graph is distinguished. The performance of the classifier is verified by experiment.
Keywords/Search Tags:Graph mining, Subgraph isomorphism, Canonical code, Feature graph pattern, Information entropy, Associative classification
PDF Full Text Request
Related items