Font Size: a A A

Construction Of Kernels For Text Similarity Detection And Application In Distributed Information Retrieval

Posted on:2013-04-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:X H WangFull Text:PDF
GTID:1228330395453673Subject:Systems Engineering
Abstract/Summary:PDF Full Text Request
With the rapid growth of the internet, digital libraries and other information source, data items are spreading across all the worldwide with heterogeneous data structure to nodal points. The connections of those nodal points build the distributed information systems. How to quickly present what a user needs from the "information ocean" with lower cost, higher precision and higher recall from the distributed information resources is a challenging issue. Distributed information retieval is a kind of information retrieval which focuses on the distributed heterogeneous inforamtion system. Within the information retrieval community, the problem of retrieving data items from a set of collections/databases (DBs) which are distributed in different servers is referred to as distributed information retrieval (DIR). Collection Selection and Result Merging are two main sub-problems in DIR. The text similarity computation is to compute or compare the similarity between two presented texts, which is a important issue in the fields of linguistics, psychology and information theory. It is also a basic issue in the fields of information retrieval, data mining, knowledge management, artificial teligentence and so on. It’s a basic technology in the field of natual language processing, as well as in copy detection, novelty detection, information filtering and so on. It is key issue to how to improve the precision and recall of text similarity computation。This paper focused on how to retrieval the similarity texts in DIR with fast speed, high precison and high recall as possible as we can. The main work of this paper includes:(1) We proposed a resource selection method in DIR based on set covering. Resource selection, also called server selection, collection selection or database selection, is a foundational problem in distributed information retrieval (DIR). This paper introduced a set-covering-based algorithm for resource selection in DIR, with consideration of overlapping extent between resources. Give different document with different weight according to its position in merged results for query Q. Only results that have not appeared in some earlier selected resource are focused on in later selected resources. The score of each resource is decided by the total weights of those merged results included in, and only the resource with max score is selected in each selecting step. So, the selecting order is the actual rank of selected resources which are used to search the query Q’, which is similar to question Q. The approach saves big searching time due to overlapping between databases and, at the same time, enhances the recall and precision.(2) Combined Kernel Function and Application to Result Merging in DIR. Improved latent semantic kernel (LSK) was combined with analysis of variance (ANOVA) kernel to calculate text similarity in this paper. To enhance the performance of result merging for distributed information retrieval (DIR), a new merging method was put forward, which was based on relevance between retrieved results and query. The combined kernel was used to calculate the relevance between the result and query. Experimental results showed that the result merging precision of the combination of LSK and ANOVA kernel (CLA) is16.79%,30.73%,20.37%,24.17%,14.25%,13.50%and7.53%higher than that of Round-robin, ComMNZ, Bayesian, Borda, SDM, MEM and regression SVM respectively. CLA kernel method has better performance for result merging and is a practical method for result merging in DIR.(3) New Kernel Function Construction and Application to Result Merging in DIR. To enhance the performance of detecting similar texts, a novel kernel function named S_Wang kernel was constructed. Based on the actual situation of text similarity computation, the S_Wang kernel was newly built with consideration of the Euclidean distance and product between vectors that represented the text documents to be compared. It was proved that the function can be constructed as a kernel function according to Mercer theorem. Experimental verification of the performance of the kernels in the text document similarity calculation was provided. The experimental results show that the S_Wang kernel is significantly better than the precision and F1performance of other kernels like Cauchy kernel, Latent Semantic Kernel (LSK) and CLA kernel. S_Wang kernel is suitable for text similarity detection.(4) Evaluation Methods on Distributed Information Retrieval. Collection selection and result merging are two major sub-problems in the field of DIR. Computing cost, retrieval precision and retrieval recall are three main evaluation indexes in DIR. This paper developed a multi-variable quantitative partial differential equation (PDE) model which was inspired by the Laplace equations, linking collection selection method and result merging method with cost, precision and recall indexes. Experiments were then conducted to determine the empirical and practical evaluate performance of the model. Experimental results on50topics of TREC indicate that the multi-variable PDE model of evaluation in DIR has a good performance and is a practical alternative.
Keywords/Search Tags:text similarity, kernels, distributed information retrieval, resource selection, resultmerging
PDF Full Text Request
Related items