Font Size: a A A

Strategy Of Optimized Query In Frequent Subtree

Posted on:2012-05-14Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhaoFull Text:PDF
GTID:2178330332999782Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Today is the information age, it is closely linked with people's basic needs of real life that all aspects of a wide range of information. With the rapid progress of the social and technology, all kinds of data and information which grows as the explosive mode come into people's learning, production and living. As well as the data storage and management is facing a severe test. Fortunately, XML with the characteristics of semi-structured data is produced. How to effectively management XML data in the database has become a hot research issue.In short. Data Mining is a non-trivial process that mining the useful, potential, no noise, innovative knowledge and information from the abundant, vague, incomplete, random information which have been collected in certain type, and ultimately produce the understandable data model. We can find the knowledge model for information processing and process control we need from the Datasets via processing and analysis those data. The emergence of the XML data holds the characteristics of semi-structured and self-description makes the original relational database technology face a new revolution. XML researchers in-depth studied the XML data from the form to essential, including the mining of XML data which include the frequent pattern mining, classification and clustering, association rules and so on.The mining of the frequent models which has been related to computer networks, information retrieval, medical information analysis, Web mining, bioinformatics and so on from the XML database is one of the important aspect researches. And the frequent subtree mining is an important aspect of the frequent models mining. With the development of Data Mining, frequent subtree mining has become a new field of study and has been used in many practical applications. How to find the frequent models quickly, query and update the information of the frequent model have been become the urgent requirement of times. In this paper, we proposed an idea that conversion the XML frequent subtree to relational database, and then the query of the XML document will be conversed to the relationship tables. This way can shorten the access time and improve the update efficiency.The query technology is based on the index while the index is based on the coding. There are three common used coding which are region encoding, prefix encoding andκ-ary tree encoding. Based on the Dietz encoding, we proposed a new encoding-PLDC (Parent Level Dietz Coding) which is based on the original code:add the parent node's preorder number and the level number. The preorder traversal number is the primary key. According to the information. we index the nodes in the XML document and stored the information of these nodes. Any node of the XML document, we can find the only path from the root to this node by the PLDC encoding. This process queries the relational tables only and doesn't need to traverse the whole XML document tree. In the following article, we apply this idea to mine and store the frequent subtree to relational tables, then achieve the query optimization.The main structure of the article is as follows:The first chapter briefly introduces the situation of the XML data mining, the frequent subtree mining, coding knowledge, the significance of this paper, the present research status at home and abroad, and the main research content and the structure of the article.The second chapter describes the XML technical knowledge related. Through a simple example of XML document to understand the XML, analyze the XML document structure information; describe the character of XML document. This part focus on the technical knowledge related, including the document structure standards (DTD, Schema) and the comparison between them.The third part of the article descript the concepts of graph, tree, subtree etc, and the classification of trees:free tree, rooted un-ordered tree and rooted ordered tree. And then deduce the concept of the subtree and the classifications:Bottom-up subtree, Induced subtree and Embedded subtree. From the mathematical point of view, we give the concept of the support and frequent subtree. Another focus of this chapter is the description of the encoding knowledge. Now, there are there kinds of encoding:region encoding, prefix encoding andκ-ary tree encoding. Based on the Dietz encoding which is one of the region encodings, we proposed a new encoding which is index the XML document. Converse the XML document into relational tables, to receive the purpose of the facilitate query.The forth part of the paper descript and analysis the algorithm:parse the XML document to a DOM tree, index the DOM tree, mine the frequent subtree, replace the original frequent subtree with the new tag which is different from the nodes' tags in the XML document tree, stored the subtree into the relational tables, and finally, achieve the operation of query, access and update. In the end of this part, we give the analysis of the experimental results.The last part of this paper is the fifth chapter which concluded the above work, and proposed some suggestion and future direction of work.
Keywords/Search Tags:XML Document, Data Mining, Frequent Subtree, Encoding, Optimization Query
PDF Full Text Request
Related items