Font Size: a A A

Research On Interactive Information Retrieval From XML Documents

Posted on:2011-05-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y M GuoFull Text:PDF
GTID:1118330332486368Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
During the last few years XML has become a de factor standard for information representation and data exchange over the Internet. With the emergence of more and more XML documents, how to effectively and efficiently retrieve information from XML documents has become an important research problem. Since XML documents are semi-structured data lied between structured data and unstructured data, they has some logical structure and corresponding semantics. It raises hope to improve the effect of the traditional keyword retrieval system by making use of the structure of XML document for more focused and exact search. However, how to make use of the structure to effectively and efficiently retrieve information from XML documents is a huge challenge due to the heterogeneous, complicated and extended structure of the XML documents.Although XML query language from database technology can be used for expressing more complex queries, it is difficult to form an exact query for expressing user's information need due to unfamiliar with the structure of documents. Hence, XML information retrieval also faces the same problem about the ambiguity of expressing user's information need as the traditional information retrieval. In fact, XML information retrieval is an interactive retrieval process, some key technologies of the interactive information retrieval, such as relevance feedback and clustering the retrieval results, also exist in XML information retrieval, and pose further challenge. For example, it is the first problem how to reformulate an initial keyword query into a content-and-structure query for better expressing user's information need by user relevance feedback; the second is how to effectively and fast cluster the results returned by XML information retrieval system for better supporting the user browsing. Hence, another challenge faced by XML information retrieval is how to make use of the technology of interactive information retrieval to improve the effect of retrieval.To overcome these problems in the research area of XML information retrieval, this thesis focuses on the research on interactive information retrieval from XML documents. The research work is composed of two issues. The first one is how to effectively and efficiently retrieve information from XML documents by combining the content and structure of the documents, such as node numbering scheme, index, retrieval model and query processing algorithm. The second one is how to solve the ambiguity of expressing user's information need in XML information retrieval by researching some key technologies of the interactive information retrieval, such as relevance feedback and clustering the retrieval results.The research work and our contribution in this thesis mainly consist of the following four aspects:1) Research on XML node numbering scheme and index structure. We design a novel node numbering scheme used for XML documents, which can effectively and efficiently code the structural information of XML documents. Based on this novel node numbering schema, an effective index structure for XML information retrieval has been constructed, which integrates text content index with structure index, and supports keyword retrieval and content-and-structure query.2) Research on XML information retrieval model and query processing algorithm. We propose an XML information retrieval model named as extended vector space model with fuzzy structure matching, which can support fuzzy structure matching and relevance score by extending the traditional vector space model. The extension to the traditional vector space model mainly consists of two aspects:one is extending the "term" concept into the "structural term" concept; the other is extending the exact match of terms into the fuzzy match of terms. This novel retrieval model can effectively integrate keyword retrieval, label-keyword retrieval and path-keyword retrieval into a unified retrieval model. Furthermore, we also design effective query processing algorithm for XML information retrieval and implement a proto system for XML information retrieval.3) Research on relevance feedback in XML information retrieval. To overcome the deficiencies in current relevance feedback methods for XML information retrieval, we have proposed a novel relevance feedback method for XML information retrieval by researching and extending the traditional relevance feedback method. Our proposed relevance feedback method can reformulate the initial keyword query into a content-and-structure query by effectively combing content and structure relevance feedback. This novel relevance feedback method is composed of three algorithms:content term expansion, term path expansion and the retrieval granularity feedback.4) Research on how to cluster the results returned by XML information retrieval system. We develop a novel feature extracting model from XML documents by exploring characteristics of XML documents. The novel feature extracting model can effectively combine content features and structure features for measuring the similarities between XML documents. We model the problem of clustering the results of XML information retrieval as the problem of k-center clustering. By improving the greed algorithm for the k-center clustering problem, we develop a novel fast clustering algorithm for clustering the results of XML information retrieval.In order to evaluate our algorithms or methods, we have designed and implemented a series of experiments in comparison with other previously proposed algorithms. The results show our algorithms have more efficient and effective than others.
Keywords/Search Tags:XML Document, Interactive Information Retrieval, Node Numbering Scheme, Retrieval Model, Relevance Feedback, Clustering
PDF Full Text Request
Related items