Construction Of Kernels For Text Similarity Detection And Application In Distributed Information Retrieval

Posted on:2013-04-26

Degree:Doctor

Type:Dissertation

Country:China

Candidate:X H Wang

Full Text:PDF

GTID:1228330395453673

Subject:Systems Engineering

Abstract/Summary:

PDF Full Text Request

With the rapid growth of the internet, digital libraries and other information source, data items are spreading across all the worldwide with heterogeneous data structure to nodal points. The connections of those nodal points build the distributed information systems. How to quickly present what a user needs from the "information ocean" with lower cost, higher precision and higher recall from the distributed information resources is a challenging issue. Distributed information retieval is a kind of information retrieval which focuses on the distributed heterogeneous inforamtion system. Within the information retrieval community, the problem of retrieving data items from a set of collections/databases (DBs) which are distributed in different servers is referred to as distributed information retrieval (DIR). Collection Selection and Result Merging are two main sub-problems in DIR. The text similarity computation is to compute or compare the similarity between two presented texts, which is a important issue in the fields of linguistics, psychology and information theory. It is also a basic issue in the fields of information retrieval, data mining, knowledge management, artificial teligentence and so on. It’s a basic technology in the field of natual language processing, as well as in copy detection, novelty detection, information filtering and so on. It is key issue to how to improve the precision and recall of text similarity computation。This paper focused on how to retrieval the similarity texts in DIR with fast speed, high precison and high recall as possible as we can. The main work of this paper includes:(1) We proposed a resource selection method in DIR based on set covering. Resource selection, also called server selection, collection selection or database selection, is a foundational problem in distributed information retrieval (DIR). This paper introduced a set-covering-based algorithm for resource selection in DIR, with consideration of overlapping extent between resources. Give different document with different weight according to its position in merged results for query Q. Only results that have not appeared in some earlier selected resource are focused on in later selected resources. The score of each resource is decided by the total weights of those merged results included in, and only the resource with max score is selected in each selecting step. So, the selecting order is the actual rank of selected resources which are used to search the query Q’, which is similar to question Q. The approach saves big searching time due to overlapping between databases and, at the same time, enhances the recall and precision.(2) Combined Kernel Function and Application to Result Merging in DIR. Improved latent semantic kernel (LSK) was combined with analysis of variance (ANOVA) kernel to calculate text similarity in this paper. To enhance the performance of result merging for distributed information retrieval (DIR), a new merging method was put forward, which was based on relevance between retrieved results and query. The combined kernel was used to calculate the relevance between the result and query. Experimental results showed that the result merging precision of the combination of LSK and ANOVA kernel (CLA) is16.79%,30.73%,20.37%,24.17%,14.25%,13.50%and7.53%higher than that of Round-robin, ComMNZ, Bayesian, Borda, SDM, MEM and regression SVM respectively. CLA kernel method has better performance for result merging and is a practical method for result merging in DIR.(3) New Kernel Function Construction and Application to Result Merging in DIR. To enhance the performance of detecting similar texts, a novel kernel function named S_Wang kernel was constructed. Based on the actual situation of text similarity computation, the S_Wang kernel was newly built with consideration of the Euclidean distance and product between vectors that represented the text documents to be compared. It was proved that the function can be constructed as a kernel function according to Mercer theorem. Experimental verification of the performance of the kernels in the text document similarity calculation was provided. The experimental results show that the S_Wang kernel is significantly better than the precision and F1performance of other kernels like Cauchy kernel, Latent Semantic Kernel (LSK) and CLA kernel. S_Wang kernel is suitable for text similarity detection.(4) Evaluation Methods on Distributed Information Retrieval. Collection selection and result merging are two major sub-problems in the field of DIR. Computing cost, retrieval precision and retrieval recall are three main evaluation indexes in DIR. This paper developed a multi-variable quantitative partial differential equation (PDE) model which was inspired by the Laplace equations, linking collection selection method and result merging method with cost, precision and recall indexes. Experiments were then conducted to determine the empirical and practical evaluate performance of the model. Experimental results on50topics of TREC indicate that the multi-variable PDE model of evaluation in DIR has a good performance and is a practical alternative.

Keywords/Search Tags:

PDF Full Text Request

Related items

1	Research On Key Technology Of Full-Text Retrieval Based On Distributed Computing
2	Improving Resource Selection And Result Merging In An Uncooperative Search Environment
3	Research On Learning Algorithms For Resource Selection And Results Merging In Distributed Information Retrieval
4	Research On P2P Search Technology In Uncooperative Environments
5	Research On The Distributed Indexing Platform And Information Filter In Distributed Full-text Retrieval System
6	Chinese Text Similarity Research Based On Semantic And Text Structure
7	Research And Application Of Full Text Retrieval Based On Hadoop
8	Query Expansion And Cluster Based Distributed Information Retrieval
9	Study On Information Retrieval Of Quality Internet Public Opinion Monitoring System
10	Research On Several Problems In Text Retrieval