Font Size: a A A

Research On The Index Technology Of Semi-structured Data

Posted on:2012-07-21Degree:MasterType:Thesis
Country:ChinaCandidate:Z Q FanFull Text:PDF
GTID:2178330332499574Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Semi-structured data is the data between structural data and completely non-structured form of data. XML as a special semi-structured data provides a standardized, flexible and powerful service, considered as an open standard, it can be exchanged in different platforms and applications, and XML is the best selection to meet the network functions. With the rapid development of Internet, XML data filling the entire Internet, it is no exaggeration to say that XML is taking over the online world, and is the basis for the web services and SOA. In case of the vast amounts of XML data, efficient query becomes urgent needs, which require good indexing mechanisms. Right now, the hotspot and the focus of the Semi-structured data technology is index technology of XML dataThis research focus on semi-structured data indexing technology, particularly XML data indexing technology, and proposes a new index structure.This paper first describes the details of the characteristics of XML documents, as well as its compliance with specifications, and introduces the basic features and basic knowledge of XML, including the XML schema information, DTD, XML Schema, XML document processing and the main query mode. At the end of the second chapter describes the macro-analysis of XML Indexing. The third chapter introduces the mainstream of index technology and study the index technology deeply which plays a vital role in querying data, mainly analyzes the structure of XML indexes, XML structure tree coding scheme, including the prefix code, area code and the k-tree coding scheme, and analysis the index structure of SUPEX mode and stack-based merge join algorithm.This article proposes a hybrid combination of the index structure, including the structure and content part, not only supports the structure path expression query, but also quickly support the content based on keyword search, just like the internet search engine.In case of XML data, choosing a good encoding is the top priority of dealing with XML data. Scholars often use the Dewey encoding, but this way of coding has a flaw. If a document tree structure is very complicated, then the coding associated with Dewey will result in a longer length when the encoding the back of node, which is not conducive to storage and processing, and the Dewey encoding can not support the XML documents changing frequently.If choosing the regional coding that bases on Li-Moon, all nodes code length is fixed, and also reserved node interval, which can be give convenience for inserting new nodes in future, giving new node codes, rather than re-coding of all nodes.In order to support the proposed hybrid index structure better, we utilize a more excellent PL encoding mechanism. On the basis of regional coding, we add a path identification to the traditional Li-Moon node identification to record the path information from root node, when using the path expression queries, it can match the path of rapid identification, get a immediately response, improve query efficiency, by adding a level identification to each node,it can help quickly determine whether or not the two nodes are parent-child relationship, further improve the efficiency of processing XML documents, and reduces the space occupied by the code.In this paper, we use B+ trees to build the structure index to support structure query. B+ tree only stores the records in leaf nodes, and the tree of intermediate nodes stored the key code, which is used to guide the index to retrieve the appropriate leaf node, intermediate node, the node stores a pointer pointing to his son, and then the bottom of the leaf node stores, the real record of the XML tree structure. In this paper, the B+ tree leaf node is used to store the index of the actual data object, and by defining two pointers pointing to the root of the tree and key values of the smallest leaf node. Structure established by this index, not only can find any node by the order of leaf nodes in, but also can be used to find nodes from the top down to the leaf nodes by random.We learn from search engine inverted list ideas, put forward the contents of two-dimensional inverted list index structure to support the keyword query for the content. General structure of a single inverted list, it is easy to produce the document information redundancy, to increase the space occupied by the index, and reduce the efficiency of search queries, especially the inverted list structure obviously can not be extended, inflexible, if a document changes, all the documents need to update the inverted records. Two-dimensional inverted index table to provide detailed information on each node, and a corresponding pointer, can overcome the shortcomings of general inverted list, saves space, reduce redundancy of data, flexibility and improve content retrieval efficiency.Through experiments, for several different data sources to establish the index, compared with other major index, this hybrid index structure in the index space, both the structure of the index or text index, have a great advantage, also fully demonstrates the encoding used to improve the PL has a huge advantage, semi-structured data in the different areas to promote the use; Terms in the query, using a different path query expressions, including the simple path query, and complex path queries, by using a unique link algorithm to get a good query time efficiency; Full text keyword-based queries, can target specific keywords are given to quickly check out the location of the corresponding node, has a good query efficiency, compared to other indexes, this index structure in the query time has some advantages; Maintenance updates in the index, for which the data set of data to do a simple document into action, the use of the area code, has been set aside for the new node inserted a space, so when you insert a new node, the corresponding index The time required to update a very small, while others need to re-index all the updated index structure of this article has some real-time update capability. The hybrid index structure in this paper, mainly for the simple query, but does not fully support the structure of complex queries, and deals mainly with static XML documents, XML data for the frequent changes in the fully dynamic updates still need further study.All of this research project from the Technology Development Program of Jilin Province "Key Technology Study of Semi-structured Database" (20090704)...
Keywords/Search Tags:XML, Index, Encoding Method, B+ Tree, Two-dimensional inverted list
PDF Full Text Request
Related items