Font Size: a A A

Research On Chinese XML Information Retrieval System

Posted on:2005-05-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:W M QuFull Text:PDF
GTID:1118360122493287Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
XML information retrieval syslem differs greatly from traditional information retrieval system in the construction of both inverted text index and structural index, the evaluation of both keyword based query and structure based query, and the effects of structural information on result relevance ranking. To manage large-scale XML documents with complicated structure, this dissertation focus on the efficient structural indexing algorithm for XML data, result size estimation problem for XML structure based query optimization, result relevance ranking algorithm, and infrastructure for XML query processing for both text-rich and data-rich XML documents. To address the aforementioned issues, this dissertation makes the following contributions. First, it investigates the drawbacks of existing indexing algorithm for XML data, and propose a dynamic indexing algorithm for XML data based on D-bisimilarity, DifX. It can dynamically determine the structure information need to index according to real query loads and optimization of index. Second, to consider the effects of structural information on result relevance ranking, this dissertation proposes a ranking algorithm that consider both the frequency distribution and structural distribution of keywords in the result, and a dynamic element-oriented method to compute the weight of keywords. Experimental results prove the effectiveness of our solution. Third, this dissertation analyzes the complicacy of result size estimation problem for XML structure based query optimization compared to its counterpart in traditionally relational database, and proposes a full-featured result size estimation algorithm for XML query, SXM. For simple path expression query, this dissertation proposes a dynamic synopsis model for XML data based on the concept of F-stable and B-stable, XMap. For complicated path expression query, this dissertation adopts an improved Bifocal sampling method for result size estimation. For value predicate in XML query, this paper proposes a wavelet-based multi-dimensional histogram for the result size estimation. Finally, SXM integrates the three estimation algorithms mentioned above by XMap scheme to provide estimation for the whole XPath query. Fourth, this dissertation presents W2X (Way to XML), a prototype of Chinese XML document retrieval system developed by us. W2X have several merits: to begin with, it can retrieve Chinese XML document; moreover, it can process both text-rich XML data and data-rich XML data; besides, it adopts efficient indexing algorithm and query processing algorithm introduced by this dissertation, which enables W2X to manage large-scale XML data.To summarize, our works make XML document retrieval system more efficient, accurate and practical.
Keywords/Search Tags:XML information retrieval system, indexing algorithm, relevance ranking algorithm, result size estimation
PDF Full Text Request
Related items