Font Size: a A A

Research On Techniques Of Mining Frequent XML Patterns

Posted on:2009-11-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y J BeiFull Text:PDF
GTID:1118360242483030Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Due to its simplicity, scalability, interoperability, openness, and flexibility, XML data are widely used in the areas such as data exchange, data integration, data distribution, data storing, data management, knowledge management, and information retrieval, etc. Discovering patterns in XML data has recently become an interesting topic. However, the traditional database mining approaches mainly handle the structured data and may not adequately address the problem of mining semi-structured patterns.Recently, the research community has seen growing interests in extracting useful information or patterns from XML database. Researchers have proposed a few algorithms to mine various frequent patterns from XML data. However, due to the irregular, complex nature of XML data, there still remain many kinds of frequent patterns to be mined from the XML data. Moreoever, there is still no uniform and abstract model to describe the frequent patterns mining process of XML data.In this thesis, we first study the model and representation of XML data, and then propose a uniform and abstract framework of mining frequent patterns from various XML data. Based on this framework, we present our study on the following mining problems: (1) mining frequent XML tag sequences, (2) the offline and online mining approaches of frequent XML query patterns, and (3) mining frequently changing structures from historical XML versions:Research on mining frequent XML tag sequences for XML document clusteringUsing the idea of "divide and conquer", we propose a frequent XML tag sequence miner based on the lattice theory. We partition the database into equivalence classes based on common prefix sequences and mine frequent patterns within each equivalence class. An XML clustering algorithm is proposed using the mined tag sequence. When clustering, we not only take the containment between the documents and the sequences into consideration, but also handle the sequence characteristics, such as the length, the frequency, and the continuity of the sequence in the original documents. In this way, we improve both the evaluation results of the similarity between XML documents and the clustering quality.Research on mining frequent XML query patterns for XML cachingWe analyze the structure characteristics of XML queries, and propose a tree mining algorithm named BUXMiner for finding XML query patterns. BUXMiner employs an efficient bottom-up approach, which enumerates all candidate trees over a compact global tree guide and computes the support of candidate tress on the mentioned tree guide. In addition, we also propose a mining approach called BUMXMiner, based on BUXMiner, to discover the maximal frequent XML query patterns. We apply our XML query pattern mining algorithm to a caching prototype system to evaluate the query performance improvement. We evaluate the performance of our proposed approaches and show that our algorithm outperforms previous ones in terms of efficiency. Furthermore, we illustrate that the caching scheme utilizing the proposed frequent query pattern mining results is more efficient compared to traditional caching policies such as LRU and MRU.Research on online mining of frequent XML query patterns in streamsWe propose an online algorithm to mine frequent XML query patterns from XML query stream. In this problem, we employ the sliding window model to control the memory which stores the sample XML queries. To better maintain the query stream in the pool we introduce a data structure named global trie. And to generate sub-queries, we enumerate the query from bottom to top based on the concept of prefix equivalence class. Experiments show that our approach can deal with XML query stream effectively and stably.Research on frequently changing structure mining from XML historical versions using two-bitmap structure B-DOMWe first study the dynamic XML structure mining problem and describe the mining framework of XML changing structures, as well as the measured criteria of frequently changing patterns. We then introduce the concepts of frequent insertion structures and frequently deletion structures. Based on these concepts, we present the mining approach of frequently changing structure from different versions of XML documents. We propose a two-bitmap structure called B-DOM as a maintainer to store and manage the changing information of the XML data, and extract frequently changing structure from the B-DOM. Experimental results show that our algorithm is both efficient and effective.
Keywords/Search Tags:XML, data mining, frequent patterns, tag sequence mining, query patterns, changing structure, clustering, query caching
PDF Full Text Request
Related items