Font Size: a A A

Study On Frequent Subtree Mining And Its Application In XML Mining

Posted on:2009-07-27Degree:MasterType:Thesis
Country:ChinaCandidate:Y W ZhuFull Text:PDF
GTID:2178360245475961Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Frequent patterns mining is an important problem in data mining domain. It involves mining transactions, sequences, trees and graphs. Especially frequent subtree mining is widely used in domains such as bioinformatics, web-mining, chemical data structure mining, and so on. It has attracted tremendous interest among researchers. With rapid development of Web technology and application, XML has become the standards for data representation and exchange over the Internet. It is a great challenge topic of mining the useful knowledge from semi-structured XML data. This paper studies on frequent subtree mining and frequent subtree mining applied in XML mining with maximal frequent embedded subtree mining, mining maximal frequent induced or embedded subtrees using length-decreasing support constraint, XML frequent pattern mining and XML documents clustering by structure. The main contributions of the paper are listed as follows.(1) Discuss the actualities of the recent researches on frequent subtree mining. Moreover the related algorithms using both synthetics and real datasets are implemented, and the virtue and deficiency of these algorithms are compared.(2) Propose a novel algorithm CMPETreeMiner for discovering maximal frequent embedded subtrees from ordered labeled trees. The attribute of node's scope is added in the preorder sequence to store tree. Pseudo-projection technique is used during projection and each node is encoded. Mining the encoded frequent sub-sequence can be directly correspond to an embedded frequent subtree. Several techniques are proposed to prune the frequent sub-sequence that do not correspond to closed or maximal frequent embedded subtree using code and scope information. Experiment results show that CMPETreeMiner is efficient and effective.(3) Propose an algorithm called SCCMTreeMiner, that finds all closed and maximal frequent induced subtrees using length-decreasing support constraint. SCCMTreeMiner uses rightmost expansion technique to systematiclly enumerate all candidate subtrees. Two new pruning methods are proposed to improve the performance of SCCMTreeMiner. During the mining process support decreases as the length of the tree increase. Experimental results show that SCCMTreeMiner not only generates more concise result set, but also runs much faster.(4) Present algorithm SCCMETreeMiner which can find all closed and maximal frequent embedded subtrees using length-decreasing support constraint. SCCMETreeMiner combines the rightmost path expansion scheme and projection technique to construct pattern growth space, and uses several techniques to prune the branches of the enumeration trees that do not correspond to closed or maximal frequent subtrees under constraint. Experimental results show that SCCMETreeMiner is effective and efficient, and it generates more concise result set, which is irredundant and interesting to users.(5) Present algorithm GCFS for clustering XML document structure based on frequent subtree patterns. It firstly mines all maximal and closed frequent induced subtrees from XML documents; then chooses some subtree patterns to form the clustering features, weights these features according by the length of subtree pattern, computes the similarity of two XML documents by cosine fuction, uses K-Means algorithm to cluster documents by clustering features. Experiment results show that GCFS is effective and efficient, its performance is superior to other XML clustering algorithms.(6) Present an impoved algorithm GCFS~* for clustering XML document structure based on frequent subtree patterns. During the maximal frequent subtree mining process it can get the best minimum support automatically; then choose some subtree pattern to form the optimistic clustering features; uses CLOPE algorithm to cluster documents by clustering features, without giving the number of cluster. Experiment results show that GCFS~* is more effective and efficient then GCFS.
Keywords/Search Tags:Frequent subtree mining, XML mining, Length-decreasing support constraint, Maximal frequent subtree, XML frequent pattern, XML clustering by structure
PDF Full Text Request
Related items