Font Size: a A A

Research On The XML Pseudo Relevance Feedback Technology Based On Clustering Search Results

Posted on:2013-11-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:M J ZhongFull Text:PDF
GTID:1228330395973044Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the continuous develpoment of XML application, Extensible Markup Language(XML) has become the de-facto standard for data representation and exchange on the World-Wide-Web. Due to the nature of irregular and variable structure, how to effectively retrieve from the promising XML data is an active frontier research area in database and information retrieval field.At present, the well-developed database-style query processing techniques have been made great achivements while on the contrary, the fuzzy query based XML information retrieval is unsatisfactory. The main reason of low quality in information retrieval is that it is difficult for users to describle their query intention. Especially, due to the existence of structure in XML document, expressing the query intention involves not only keywords but also structural information, which is very difficult for ordinary users, because it requires understanding the DTD of XML documents. As a method for improving the retrieval quality, feedback technology is introduced to help users overcome the problem of translating an information need into a query and then formulate a better expression. Pseudo relevance feedback (PRF) has demonstrated to be in general an effective query expansion technique for improving retrieval effectiveness. The assumption underlying PRF is that the top-ranked documents are related to the query. In this paper, surrounded by the presence of the "query topic drifting" in PRF, many works have been done.In PRF, we are presented with two aspects to solve:(1) How to identify relevant documents without any relevance information. Many studies have showed that the basic assumption of PRF is not always true and that not all those documents are really relevant. The noise introduced by non-relevant documents could cause the expansion query drifting away from the original topic and the performance to decrease. Therefore, it is the first problem to identify the relevant documents set from the orginal query results.(2) How to perform query expansion based on the obtained relevant information. For XML documents, it is not only the keyword expansion but structure information.The following problems are addressed in this thesis.(1) We initiate the problem of clustering XML search results. The clustering process includes two sides, the one is how to make full use of XML documents feature to measure similarity between them, and the other is which clustering algorithm is adopted to obtain the high quality of clustering performance. In this thesis, similarity measurement is studied according to two different granularity levels of an XML document. Firstly, based on the whole documents granularity, a similarity measure method combining content with structre semantics(CASS) is proposed; Subsequently, based on CASS, the semantic similarity measure method(LSI-CASS) is further proposed which make full use of LSI model for the elements node granularity. Meanwhile, we have explored the optimal partition of clustering algorithm and proposed k-medoid clustering algorithm based on optimized initial center points and evaluation function, in which the optimal number of clusters is automatic acquisition.(2) We also research into the problem of effective document ranking model based on clustering XML search results. After clustering, the documents related to query have been clustered together to a certain extent; the next key problem is how to select the relevant clusters, and then how to rank effectively these documents or fragments in them. According to the above two granularity levels, we study the candidate clusters ranking model and the documents or fragments ranking model based on candidate cluster respectively. Firstly, the candidate clusters ranking model based on cluster center is proposed. Subsequently, combining with the structure feature of xml documents, a series of ranking metric are presented for documents or fragments ranking. Finally, a good XML pseudo-relevant document or fragment set is formed through the two ranking models.(3) We study the XML query expansion. In this thesis, a XML query expansion method based on PRF is presented. On the one hand, keyword expansion method is explored. The method use term weight computation with structure and selects those terms with high weight value as expansion, which improve the perfomance of retrieval. On the other hand, a full-edged content+structural query expression is formalized which adopt the structure expansion method based on maximal semantic weight of tag.The contribution of this thesis can be summarized as follows.(1) We propose the technical route of XML pseudo-feedback based on clustering search results. Now, the research findings of XML pseudo-feedback is few at home and abroad, especially to the clustering XML search results. This route can effectively address the problem of low quality of expansion resource in existing of tranditional PRF by using making full use of some features of clustering. Firstly, in the ranking model of the candidate cluster, we take cluster label based on equalization weight value to select several resonable clusters which are relevant to the query. Secondly, compared with tranditional ranking method, we mak full consideration of clustering features, such as the similarity between documents and cluster, the ranking of the candidate cluster and so on, to rank documents effectively. The experimental results indicate that the method is effective, and also prove that clustering search results are benefificial for obtaining the high quality XML pseudo-relevant documents.(2) In clustering XML search results, this paper proposes extension vector space model with structure semantics and puts forward the similarity measurement method between documents which integrates content and structure semantic.The measurement method has the following features, on the one hand, the method which is mainly based on contents with supplementary of structural restraint integrates content and structure of XML documents. The integration is different from the existing XML document similarity measurement method which all completely separate the connection between content and structure feature and obtain the bad performance especially in homogenous corpus. On the other hand, term weight computation considers not only tranditional term frequency but also such features reflecting structure semantics as semantics and level information of tag. The use of these features can avoid the limitation of specifying parameters prior by user and hence obtain better flexibility and universality.(3) For the first time, we study XML document clustering based on the elements node granularity. We put forward XML similarity mearurement method, namely LSI-CASS, combing content with structure semantics based on lexical semantics. The method is different from previous XML similarity measurement, which gives full consideration to the lexical semantics between them and obtains the core concept of document content by using latent semantic indexing technology. The experimental results on homogenous corpus show that LSI-CASS produces better clustering quality than other methods.
Keywords/Search Tags:XML, clustering search results, contents and structures, cluster labels, ranking
PDF Full Text Request
Related items