Font Size: a A A

Research On The System Based On XML Integrating Data Query And IR

Posted on:2007-08-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z M HanFull Text:PDF
GTID:1118360215477610Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
XML has become the standard for the data expression and the data exchange on the Internet. With the fast growth of document and data quantity in XML, many questions arise among which how to effectively and efficiently query these documents is the most important one. By querying a document, we mean structural query. Structural query involves document storage mechanism and index structure, etc. These questions have attracted a lot of attention. Another important question is information retrieval based on XML, which emerges recently. Faced huge network information, a user must retrieve the truly effective information.Basically, the majority of search engines is based on keywords, and rank on page link or content retrieval. If the involved pages are written in XML, then their searching requires reasonable use of XML's structural characteristics, the semantic characteristics as well as other related properties. Thus, the information retrieval based on XML is one extremely significant question. This question has just become a focus of the academic circle.Information retrieval and structure query for XML has the relevance and the different characteristic research question as well. Then, How to integrate these two technologies even more is worth the further research. This thesis focuses on improving these two query technologies and bettering their integration so that the user query can be optimized.To reach this goal, the massive research work is made in this thesis. First, a full text retrieval predicate is introduced in XQuery, named this retrieval language as XQuery+ (XQuery Plus), the main principle for defining XQuery+ language is using simple, mature technology to realize the integration of information retrieval and structure query for XML. The XQuery+ language has the following characteristics: (1) It is based on XQuery, which is the most popular query language for XML; (2) A retrieval predicate is introduced to extend the XQuery retrieval function; (3) The Boolean operation is also supported.To handle structure query for XML document efficiently is another task of this thesis. The foundation of structure query is node numbering scheme and index structure. At the moment, most of index structures and query algorithms are constructed on region-based numbering scheme. However this numbering scheme suffers from some drawbacks. In this thesis, a novel and efficient numbering scheme is presented, which combines the label path information and data path information and efficiently support all kinds of queries. Properties of this numbering scheme are discussed in detail. Based on the node numbering scheme, a compact index structure, named HiD, is also proposed. HiD index includes structural and value index. This structure can efficiently answers queries and supports information retrieval. At last the comprehensive experiments are designed to assess all the prsented technologies.Information retrieval based on XML is the third research topic in the thesis. As a kernel problem, relevance scoring is very important for XML information retrieval. This paper addresses this issue. A new and effective algorithm for relevance scoring is presented, which takes both structural and semantic information into account. The concept of distance is used to evaluate the structural relevance. For semantic relevance, the semantic weight is adopted. For assessing the quality of relevance algorithm for TOP-K computation, a new measure is suggested. Experiments show that the algorithm can significantly improve the Precision and Recall for XML information retrieval.Based on the research of structure query and information retrieval, the query algorithms and mechanism for the structure query and the information inspection are proposed. These algorithms can effectively handle XQuery and XQuery+ queries. Although these query algorithms are all based on HiD index structure, but their characteristics are different. Moreover, the processed object is also different. For query algorithms for XQuery, two algorithms are proposed to handle the path pattern query and tree pattern query respectively. Based on these two algorithms, the algorithms for more complex queries can be produced. Regarding to query algorithms for XQuery+ queries, there are two different retrieval algorithms, the algorithm XQuery+G-1 uses the on-the-fly method to score and query, while the algorithm XQuery+G-2 simply calculates the relevance after the query processing. Finally, the performance and efficiency for each algorithm is investigated in detail. To evaluate the performance of related query algorithms, some existing query algorithms in the literature are selected to compare with our algorithm. The experimental results indicate that both the structure query algorithm and the information retrieval algorithm proposed in this thesis have better performances and they improve query efficiency and reduce the consuming time.The last work in this paper is to develop a prototype system by means of our related technologies. Firstly, architecture is introduced in the analysis and design process. After discussing the design process of modules of the prototype, the function and implementation of different modules are introduced. The prototype system is developed by using Java language. From the architecture of the prototype and function of modules, the prototype can support the structure query and information retrieval for XML The features of the prototype include: (1) open and hierarchy structure, under which new functions and algorithms can be flexibly added to the prototype. (2) two filter mechanisms and two result representing methods, which extend the performances and usability of the prototype. All these futures lay a foundation for a business version in the future.
Keywords/Search Tags:XML, Data Query, Information Retrieval, Node Numbering Scheme, Index, Relevance Score, Query Algorithm, Prototype System
PDF Full Text Request
Related items