Font Size: a A A

Research On Frequent Pattern Of Tree Data

Posted on:2012-05-24Degree:MasterType:Thesis
Country:ChinaCandidate:C W QuFull Text:PDF
GTID:2178330332499355Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With appearing and accumulating more and more semi-structure data with tree feature, data mining technology is facing new struggle right now. The research on data mining is turning to semi-structure. Frequent pattern mining for semi-structure is an important field in data mining. Frequent pattern mining helps the other section of data mining. Frequent pattern of semi-structure is often used in cluster, classify and association rules etc.This paper mainly discusses the frequent tree pattern mining, including the following aspect of the field:(1) In the first two chapters, we discuss the basic process, concept, current status and problems of data mining firstly. And then, we introduce research and develop directions. Finally, we introduced the basic concept of frequent subtree mining, tree data categories, subtree categories and applications.(2) In the third chapter, the article discusses the standardization of the tree representation. Then, we give the standard representation for our algorithm. Edge-Counted Unordered Tree. The main algorithm in the paper compresses trees with this representation. The innovation of this representation is that indicating the actual relationship of two nodes, for both merged nodes and original nodes. Then, we give the Chain Structure for tree structure. Chain Structure and tree structure are isomorphic. Chain data is created and increased during data compressing. We have proved that it can be correctly transformed between tree and chain. After that, we give that Link_Ctre Algorithm for the transformation. Finally, the experiment has been done proves that Link_Ctre can correctly transform between unordered tree and chain structure.(3) In the fourth chapter, the article firstly introduces two kinds of strategies and analyses their advantages and disadvantages. On this basis, we give a new induced frequent subtree mining algorithm. PRTM algorithm. This algorithm compresses the Edge-Counted Unordered Tree, and add frequent subtrees accompany with compression. This algorithm avoids the disadvantage of Apriori strategy which needs testing a great number of candidates. Each subtree mined by PRTM is frequent. PRTM avoids the disadvantage of growth strategy which grows slowly. PRTM grows multiple nodes in each traversing. PRTM uses chain representation for tree as the middle result of the frequent subtrees. The chains grow when trees compressed. After travel, we transform the chains to subtrees. And by merging nodes, it compresses the tree as maximum as possible. In the experiment, we use PCITMiner Algorithm as the compare algorithm. We do the experiments with different scale datasets on both synthetic dataset and real dataset. The experiments compare the two algorithms on time efficiency performance and the number of frequent subtree results. According to the experiment, we can get the conclusion:PRTM algorithm has better performance on large-scale dataset. so it is much more suitable than PCITMiner for mining frequent subtree in large-scale dataset.
Keywords/Search Tags:Data Mining, Frequent Pattern Mining, Frequent Induced Subtree, PRTM Algorithm, Data Compression, Link Structure, Edge-Counted Unordered Tree
PDF Full Text Request
Related items