Research On The Index Technology Of Semi-structured Data

Posted on:2011-02-03

Degree:Master

Type:Thesis

Country:China

Candidate:G M Li

Full Text:PDF

GTID:2178360305955064

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Since the day of its introduction, XML technology foreshows a bright future. Along with the fast development of Web Service in recent years, XML appears in the data exchange and storage field more and more frequently, and has become the data exchange standards and foundation stone for SOA structure virtually. In today's Internet world, along with the increasingly widespread application of XML technology, data volume with XML format is growing at exponential level, the problems of storage and management of XML data present to us subsequently. How to efficiently store and manage vast amounts of XML data has become a problem demanding prompt solution. Query Processing is an important content of the field of storage and management of XML data. Meanwhile, how to create an appropriate index for XML data has become the key of XML database technology.Since XML data is semi-structured data, it is different from the relational data of the relational data model. Although some mature technology of traditional database can be applied to XML data, but due to the inherent characteristics of XML data, XML indexing technology is facing many new challenging problems. Therefore, the indexing technology of semi-structured data represented by XML is different from the indexing technology of traditional relational database.Aiming at the characteristics of XML data, this article describes the XML-related technologies, and analyses the storage mode of XML data. In aspect of the organization of the XML index, indexing strategy and indexing features of XML data have been the main contents of the research. In the XML data encoding scheme, emphasis has been given to the analysis of the bit vector encoding, prefix encoding, region encoding and binary encoding. In aspect of XML query processing algorithms, emphasis has been given to the research of SUPEX query processing algorithms and structural joint algorithm of the regional division.Regarding the index creating issues of XML data, based on the in-depth study of storage and management of XML data,combining the technical characteristics of current Internet search engine, method of combining structure query with inverted indexing technology of search engine has been proposed, joint indexing technology combining structure index with full-text indexing has been created. Of which, the encoding mechanism suitable for B + tree based index has been brought forward, and B + tree has been used to set up structural index for XML data nodes which has undergone a special encoding process. As for the text contents of XML elements, traditional inverted indexing method of search engine has been used to set up full-text index for the contents of XML data nodes. Corresponding improvement and innovation has been made in the aspects of node encoding, index structure and query processing of the XML document tree, which improves the efficiency of structure query of XML data and full text searching based on keywords. In the aspect of query processing, all types of typical cases of simple query, content query, and complex query have been analyzed, and the corresponding joint algorithm has been studies.In order to support full text searching based on keywords, the encoding scheme of XML document has been redesigned in this article. Since prefix encoding of XML document tree can not only effectively support the computing with relation included, but also can effectively support the computing of the location of any node of the document tree, therefore, we bring forward a new encoding method based on the prefix encoding: we add a path identification to the traditional Dewey node identification to record the path information of nodes, thus content and structural relationships of elements can be obtained through encoding process by using node identification, order information can be recorded by using Dewey code, the path and the Dewey identification (Path Dewey) are named as Path Dewey Pair, PD code in short. In PD coding system, path value is the serial number of different nodes of XML document tree when searching the root. When searching the roots, if the name of the current node element encountered is same as its sibling node element, then the values of their path are the same, the suffix of the value of Dewey increases, or the values of their path increase while the value of Dewey remains the same, so that it?s easier to determine whether two nodes are of ancestor and descendant relationship or brother relationship.For the full-text index, the method of inverting the order of files by search engine is adopted to create index based on keywords for the contents of each node element of the XML document tree. While for the entire XML document, index is created by using B tree for XML document which has undergone special coding process. The full text index based on keywords of contents, are sequential pair composed by word phase of the text content of the node element and PD of the text elements, the index can support queries using structural path expression and content-based keywords. The keyword of structural index is the attributes of node element and serial number of PD. The index of keyword query point to the actual physical address of the node.In the query of entire XML document, Xpath is usually used as the query path expressions, XPath contains a lot of axes. As for the horizontal axis, PD encoding can be used to make quick judgment, while regarding the vertical axis, the relationship of ancestor and decedent, father and child can be directly determined. As for a simple path expression, query can be done by traversing a XML structure index tree, while as for full text searching of element text content based on keywords, it can be done through node content index by using the inverted method. As for some complex path expression, the strategy used in this article is as following: Decompose complex query path expression into simple elements, the combine the query results of these simple elements to carry out connected operation, finally, the query result which meets the query conditions is obtained. During the experiment, the query process effect of using the index to search the XML data is validated, test of structural index and full text index has been carried out for different data sources, simple query for path expression is done based on particular XML data, complex query and full text searching based on keywords have been tested separately, and then a corresponding results are obtained based on actual experimental data: The index has better effect in aspects of queries using XML path expression and keyword-based queries. During the course of designing XML data indexing, only static XML data is taken into consideration, frequent changes in XML data source is not considered. Therefore, how to synchronize and upgrade the frequently changing data source is the keystone for future research. It is believed that through continued research, XML indexing technology will have major breakthrough. The related research work referred in the Article is supported by"Key Technology Study of Semi-structured Database"project of Jilin Province Technology Development Plan (20090704).

Keywords/Search Tags:

Encoding Method, Index Structure of XML, B+ Tree, Inverted List, Query Processing

PDF Full Text Request

Related items

1	Catalogue Research Of XML Database
2	Research On The Index Technology Of Semi-structured Data
3	Research On Inverted List Parallel Query Method Based On Dataspaces
4	A Index Technology And Query Method For XML Document Based On Textnode
5	Based On The Index Technology Of Xml Query Optimization Research
6	Research On F&B Index Structure Supporting XML Query
7	Index Compression And Query Processing In Search Engines
8	Research On Math Query Language And Index In Web-based Math Search
9	Research On Skyline Query Algorithm Based On New Data Index Structure
10	Some Research On On-Line Index For Dynamic Text