Font Size: a A A

Key Problems Research On Distributed Information Retrieval

Posted on:2013-10-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:C HeFull Text:PDF
GTID:1228330374499558Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Distributed information retrieval is one of important research fields in information retrieval. It has been drawing more and more attention in practice. Even it is indispensable in some circumstances such as aggregated search and cross-language search. Meanwhile, as in some paper, search accuracies of a distributed information retrieval system are better than the ones from an ad hoc retrieval system in a given condition. Distributed information retrieval is a technique for searching multiple databases at the same time. Usually, queries are submitted to a subset of databases that are most likely to return relevant answers. The results returned by selected collections are integrated and merged into a single list, and then are sent back to users.There are three major challenges in distributed information retrieval. At first our system needs acquire knowledge about the contents of each database (database description). For a given query, a subset of databases that are most likely to return relevant documents are chosen from all the databases which may be numerous (database selection).For a given query, results returned from selected databases are compared and merged into a single list (Results merging). In this paper, we are going to discuss and study these problems in detail. Our main achievement is following:Firstly, for database description, we verified the reliability, stability and necessity of query-based sampling in Chinese environment.Query-based sampling in uncooperative environment is the focus. Previous work was designed for English corpus. However, there is no research reported whether query-based sampling works for Chinese corpus or not. After investigating this algorithm elaborately, we in this paper conducted a series of experiments on a Chinese dataset. The effectiveness and robustness are tested and evaluated in terms of three forms:description accuracy, selection accuracy and retrieval accuracy. The experimental results show that it works in Chinese environment effectively and stably, particularly in description experiments.Secondly, for database selection, we proposed discriminative-model-based selection methods and topic-clustering-based selection methods, and tested their effectiveness.There are many papers on the database selection problem, which can be divided into several classes:term-frequency-based methods, document-based methods, classification/clustering-based methods, and others. By distinguishing discriminative models and generative models, our work contains two. One is discriminative-model-based selection methods by considering the information between databases, while the other is topic-clustering-based selection methods by considering the semantics of databases. We discussed the former methods only theoretically, but focused on the latter ones. The topic-clustering-based methods not only consider documents in databases, but also introduce and model the contents of databases, which can be explained well when modeling. Besides, we also provide a unified form from probabilistic graphical model. The experiments show the performances of our new methods are competitive on the standard datasetsLastly, for results merging, we proposed weighted curve fitting algorithm, and demonstrated its improvement over existing algorithms is remarkable and stable.CORI merging algorithm, SSL (Semi-Supervised Learning) and SAFE (Sample-Agglomerate Fitting Estimate) were the classic algorithms for results merging problem. SSL solved the unstable problem of CORI merging algorithm in uncooperative environment, while SAFE finished the problem of lack of document samples in SSL. However, SAFE has its shortcomings when using document samples for regression:one is that it overlooks the importance of top-ranked documents, the other is that it regards the estimated documents as important as the documents with true ranks. To overcome the disadvantages in SAFE, we propose Weighted Curve Fitting method. A range of experiments show, comparing to SAFE, the improvement of WCF is significant and robust. Also, we give the optimal combinations of parameters in a given condition.
Keywords/Search Tags:distributed information retrieval, information retrieval, database description, database selection, results merging
PDF Full Text Request
Related items