Font Size: a A A

Research On Web-Oriented XML Retrieval

Posted on:2006-12-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z P LiangFull Text:PDF
GTID:1118360212482091Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the global information matrix, WWW contains all kinds of data, including structured, semi-structured and no structured data. Much research has focused on the study of Web information retrieval. However, its current status is still far from satisfaction of Web users.XML has become the de facto standard to represent data in WWW and it provides a uniform data model for Web data. It is reasonable to imagine that most of the data on Web will be in XML, as the result, research on XML play a crucial role in Web information retrieval.The dissertation focuses on the key techniques for XML information retrieval issues. The main contributions of this paper include the following.1. XML allows representing both content and structure of documents. An information retrieval model for XML called X2VSM is proposed which is an extension of the well-known retrieval model, vector space model (VSM). A term in VSM is added with a path specification under which it appears and is called an XTerm. We extend the definition of tem frequency (tf) and inverse document frequency (idf) to the need of XML search. Only terms under specified paths are involved in the calculation of tfs and idfs. Both tfs and idfs are calculated with respect to the queries dynamically. We define the weight of XTerm and represent both XML data and queries as weight vectors. We also define the similarity between XML data and queries. With XTerm queries, we can inquire both structural and content information of XML. Returning results can be whole documents or elements with relevance ranking score.2. We investigate the problem of XML document clustering and present a novel approach called path-based clustering (PBSC). Instead of comparing XML documents structure and clustering them directly, we cluster the paths contained in these documents. For each path, we form a cluster containing documents that have that path. After that, we combine clusters that contain similar sets of documents. The resulting clusters will contain documents that share a similar set of paths. Compared to edit distance based approaches, PBSC is much more efficient.3. This paper proposes an index structure based on generalized suffix tree (PIGST) and presents a query evaluation algorithm. The distinct paths in an XML collection are mapped into strings. The construction algorithm of the PIGST for the path strings is presented based on the modification and improvement of a well-known suffix tree construction algorithm that only requires linear time and space complexity. The query process merely needs m character comparisons for direct containment queries, where m is the length of a query string. An efficient processing method for the indirect containment queries that avoids the inefficient tree traversal operation is also presented.4. We propose a combined index structure of path index and content index. We use inverted files extended with Dewey encoding to store the content index. Based on our index structure, an algorithm called XRank is proposed to process queries. The similarity between paths is investigated and the similarity measurement is integrated into XRank. Thus, XRank not only support content-based similarity search, but also support structure (path) based similarity search.
Keywords/Search Tags:Web, XML, Schema, Information Retrieval, Dewey ID, Suffix tree
PDF Full Text Request
Related items