Font Size: a A A

Study On Query Processing Techniques In XML Search

Posted on:2011-07-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:X P LiuFull Text:PDF
GTID:1118360332456138Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
XML (extensible markup language) has been widely adopted due to the characteristics of extensibility and self-description, leading to a large number of documents in XML format. With the increase in the volume of XML documents, there is a strong urge to query XML documents effectively. However, the well-developed database-style query processing techinques do not meet the demand of many naive users. The first reason is that XML documents are usually loosely structured, with rich text embeded in nested elements, it is generally difficult for users to understdand the structures of XML documents. Moreover, the formal XML query languages, such as XPath and XQuery, are often too complex for naive users to master. As a consequence, many users turn to the IR-style querying of XML documents, which has a user-friendly query interface, and information retrieval has attracted considerable attentions.The information retrieval techniques are rather mature nowadays, but these techniques are targeting at text documents or HTML documents, the so-called flat documents, which do not contain complex structures. As a result, the present searching techniques do not serving the purpose of searching XML documents. As XML documents accumulate on the web and in other areas, the traditional searching techniques show its limit. XML information retrieval has been one of the developing directions of seach engine in next generation, and it has great potential in areas such as web information retrieval and digital library.When searching XML documents, we are presented with two difficulties:(1) Data in XML documents are encoded with plenty of tags, which makes a tree structure. XML information retrieval can be performed at a finer granularity, i.e., query results are returned at a document fragment or node unit. Effective XML information retrieval requires a truthful understanding of the nodes (tags) present in documents, e.g., which nodes are more important and which fragments should be returned as results.(2) Information retrieval provides a rather simple interface to access XML data. When querying XML documents which usually have complex structures with simple keyword-based queries, ambiguities often arise. To satisfy users' information need, faithful understanding of the query is very important.Motivated by these observations, we take the view that semantics should be at the core of XML information retrieval. As a result, we adopt semantics-oriented strategies in this thesis. Specifically, we select answer nodes on the basis of the semantics underlying XML documents, capture query intentions through query semantics, and find relevant query results by means of the semantical matches of queries.The follwing problmes are addressed in this thesis.(1) We address the problem of inferring meaningful answer nodes given XML query and document structure. We propose criteria of ideal answer nodes, and guidelines for inferring ideal answer nodes. We investigate the impacts of structural information, textual content and users' query on answer nodes, and propose a method to infer answer nodes which is based on XML node categories and users' query information.(2) We initiate the problem of clustering XML search results, and propose a novel clustering methodology. The basic idea is to cluster XML search results according to the way results match the given query. This methodology is effective in that it disambiguate user's keyword query, and help to relieve overload of search results. Two approaches are presented to implement the methodology. The fist approach is a conventional one which do clustering in a post stage; the other one is designed based on some key observations of XML keyword search, it is capable of clustering on-the-fly, thus is much more efficient. We also investigate the ranking and top-k problem in a clustered environment, including the ranking of clusters, ranking of results in each cluster and top-k evaluation of clusters.(3) We also research into the problem of effective XML CAS (Content and Structure) retrieval. We propose a novel approach for XML CAS retrieval. The approach adopts a content-oriented point of view. Specifically, the approach first decomposes a CAS query into several fragments, then retrieves results for each query fragment in a content-centric way, and finally scores each answer node. The approach is adaptive to versatile homogeneous and heterogeneous data environments. To assess the relevance of retrieval results to a query fragment, we present a scoring strategy that measures relevance from both content and structure perspectives. An efficient algorithm is also presented for CAS retrieval.(4) We study on the index structures in XML information retrieval. We analyze the requirements of index structures for XML information retrieval, which is used as the objectives of our design of index structures. We propose a content-aware index mechanism which integrates XML structural index and inverted index. It supports XML search at any granularity, and can be used in structure-oriented or content-oriented XML queries. It also supports reconstructing and extracting XML search results.The contribution of this thesis can be summarized as follows.(1) We propose a novel semantics for answers to XML keyword query, and an effective method to infer answer nodes for XML information retrieval. The method analyzes the categories of nodes and the relationships of nodes with queries. It employs the structural information, content information embeded in XML documents and users' query to aid the inferrence of answer nodes. Experimental results show that, the proposed method yields more useful and meaningful answer nodes compared with the present methods.(2) We propose a new clustering methodology which clusters results based on the way an answer matches the given query. We use matching pattern to capture the relationship between an answer and a query, and formalize the methodology based on the concept of matching pattern. We present two approaches to implement the methodology. The fist approach is a conventional one which does clustering after search results generation, and the second approach is capable of clustering on-the-fly, thus is much more efficient. We study the problem of ranking clusters and propose an effective ranking scheme. Our ranking scheme, together with the proposed clustering approach, is able to generate top-k clusters for XML keyword search efficiently. We conduct an extensive set of experiments. Experimental results verify the proposed clustering methodology and approaches, the ranking methods are also proven to be effective.(3) We propose an effective method for XML CAS retrieval. The method has several desired features:It is flexible and adaptive to homogeneous or heterogeneous XML collections; It guarantees good search quality in terms of precision and recall; It is simple enough and efficient, and no relaxation is required. We propose a scoring strategy, which integrates structural relevance with content relevance effectively and suits CAS retrieval perfectly. We design an algorithm to retrieve results for CAS queries efficiently. Extensive experiments are conducted to investigate the effectiveness of the proposed methods, and experimental results verify our techniques.Through the study in this thesis, we have got some deep insights towards XML information retrieval and have made some important research progress on XML information retrieval. These insights and findings will help promote the research on XML information retrieval and lay a solid foundation for future research on XML.
Keywords/Search Tags:XML, search, structures and contents, answer nodes, search results clustering, index
PDF Full Text Request
Related items