Font Size: a A A

Research On Rich-text XML Document Retrieval

Posted on:2007-03-13Degree:MasterType:Thesis
Country:ChinaCandidate:T J JiangFull Text:PDF
GTID:2178360212958663Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
XML is a self-describing and extensible language, which specifies the contents as well as the structure information. There has been an exponential increase in the amount of the XML documents in Web pages on Internet, commercial text repositories, digital library and so on, and naturally, efficient information retrieval from these great amounts of XML documents is becoming extremely important.Based on the content, the XML documents have two views: the document-centric view and the data-centric view. Querying data-centric XML data is a well-explored topic with powerful database-style query languages such as XPath and XQuery set to become W3C standards. An equally compelling paradigm for querying document-centric XML documents is IR-style search on textual content.Information retrieval (IR-style) is different from database search (DB-style) in that the former is a process of inaccurate, vague and partial match. An XML document is semi-structural data with hierarchical structure and text contents. Information retrieval over XML documents can't be extended from traditional IR directly, the reason lies in that: (1) traditional keyword search don't leverage the structure information of XML documents, however, XML information can be retrieved by not only content condition but also structure (path) condition, which requires the integration of full text search and structure query; (2) XML retrieval with structure information returns XML elements (or fragments) in documents, whereas traditional information retrieval returns the entire documents; (3) unified ranking mechanism to consider vague content and structure (VCAS) retrieval; (4) the weight of node is influenced by different factors in XML retrieval.In this paper, we analyze the features of XML documents in view of information retrieval, and discuss the vagueness of user's query in natural language and influential factors of ranking VCAS retrieval results. Then, utilizing logical integrity of answer node, we analyze the factors of XML vague retrieval about relaxation on structure and content, and propose the way how to find the best query granularity from the query expression extracted from natural language vague query and search paths in XML tree. Based on these, a ranking model is designed to handle these new features; search engine is also realized in VC. Furthermore, to distinguish the user to the structure familiar degree, we propose a new idea of confident structure query and vague structure query, a new configurable ranking model is presented also, which is designed to handle these...
Keywords/Search Tags:XML Retrieval, Answer Node, Weight, SCAS, Ranking
PDF Full Text Request
Related items