Font Size: a A A

Research On The Algorithm For Mining Structured Data

Posted on:2010-07-05Degree:MasterType:Thesis
Country:ChinaCandidate:J WuFull Text:PDF
GTID:2178360275996304Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of computer science and information technology, data mining technology has been widely applied to data warehousing, artificial intelligence, pattern recognition, bioinformatics and many other fields. During the course of research, more and more questions are presented. The rapid appearance of biological information technology, social network analysis and XML query path research has made the relationship of things become more and more complex, and people began to realize that trees and graphs could describe these data structures more clearly, and therefore more useful information could be mined. In a tree or graph, a vertex represents an object and relation between two objects is denoted as an edge. However, traditional data mining algorithms on unstructured data such as sequence, transaction, text etc. have been out of work on structured data. Therefore research on structured data mainly on tree and graph, combined with experience and method on unstructured data to achieve high-performance mining operation has become an important field of research.In this paper, the main issues pay on the hot problems of data mining, including induced tree and embedded frequent tree mining on ordered tree and unordered tree, subgraph isomorphism judgement and frequent subgraph mining, as well as the graph data index and query and graph edit distance measurement, detailed introduction as follows:(1) Knowledge and key technologies on data mining and frequent pattern mining have been studied, especially introduced denotion and standardization method in tree and graph data, the growth way of substructures and key factors on tree mining and graph mining, and then analysed present algorithms to find the contribution that they made and areas that still need be improved.(2) Considering a tree could be denoted by a data structure equivalently the existing tree mining algorithms, yet denotions like this could not describe a tree intuitively, and it will make an excellent tree mining method could not be carried, therefore the growth of frequent tree will become complicated and the whole efficiency of tree mining would thereby affected. In this paper we creatively put out the concept of father-son relationship code to denote the tree intuitively, and base on this as well as pattern growth method, we brought out the frequent tree mining algorithm ITMA and ETMA, which effectively improved tree mining efficiency.Besides, based on the characteristic of preorder code, we put out the standardization method for unodered tree preorder equal code and therefore the frequent tree algorithm for unordered tree UITMA and UETMA.(3) The key problem as well as most difficult problem in graph mining is subgraph isomorphism. Present algorithms brought out the isomorphism method respectively, so that each graph could only be denoted in one code. However the performance of them is not so good. During graph mining process, each time when a k-subgraph grows to k+1-subgraph, there will take subgraph isomorphism operation, as a result the efficiency of judgement operation will directly affect the whole performance of graph mining. This paper creatively put out the judgement algorithm based on incidence matrix, so that reduced the complexity of isomorphism judgement. Based on this as well as depth first pattern growth strategy, the efficiency of graph mining is significantly improved.(4) In graph mining research, graph data query is as important as subgraph mining because there are lots of applications. In this paper, we proposed the method based on edge mapping and classified frequent patterns to establish graph index and reduced the scale of candidate results, which could effectively improved the speed of graph query. Simultaneously, to an issue that emerged as a powerful role in graph mathing and pattern recognition, graph similarity and edit distance calculation, we propose a way based on GA and cost matrix. Experiments showed that this method could be used as the rule to numerate the distance between two labeled graphs.
Keywords/Search Tags:data mining, structured data, frequent subtree, subgraph isomorphism, frequent subgraph, graph index, graph inquiry, similar graph, graph edit distance
PDF Full Text Request
Related items