Font Size: a A A

Improving Resource Selection And Result Merging In An Uncooperative Search Environment

Posted on:2016-12-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:BENJAMIN GHANSAHFull Text:PDF
GTID:1108330482459883Subject:Computer Science and Applications
Abstract/Summary:PDF Full Text Request
Most general purpose search engines, such as Google, Bing and Yahoo!, serves as a medium for information seekers to search for the large amounts of available digital information distributed over the web. The process involves acquiring, storing and processing all available information locally (centralized retrieval system or general purpose search engines). This technique is suboptimal in terms of cost, time, space and coverage. Again, a large amount of textual information cannot be accessed arbitrarily by general purpose search engines. This type of concealed information, which is very valuable, can only be accessed via other search models than the centralized retrieval model implemented by general purpose search engines.Distributed Information Retrieval (DIR), also known as Federated Search, is a powerful way of performing retrieval over distributed data concurrently and merging multiple searchable sources of information within a single interface. DIR also provides access to the concealed information by providing a single interface that connects to multiple information sources. There are three main research areas in DIR. First, information about the contents of each available information source must be obtained (resource representation). Second, given a user query, a subset of available sources must be selected for searching (resource selection). Third, the results retrieved from selected sources may be combined into a unified merged list and presented to the end user (results merging).This dissertation deals with two of the main issues of designing and implementing an efficient DIR system:source selection and result merging. The proposed algorithms presented in this thesis are designed to function effectively in both environments where information sources are cooperative and also in environments where information sources are not cooperative. The novel source selection algorithm that is presented provides a better and effective way of selecting information resources that do not only produce relevant results, as in the case of prior studies, but one that produces relevant and diversified results to meet a user information needs. Extensive experiments demonstrate that our proposed method is able to obtain a performance that is superior to the state-of-the-art approaches which considers only relevance.The second novel algorithm presented in the area of resource selection is aimed at improving the resource selection process by identifying and removing a duplicate pair collection in a DIR environment, in order to ensure the inclusion of a novel collection which hitherto would not have a place in the selected resources during a typical resource selection phase:diversification is achieved as a side effect. In this regard, a duplicate collection pair with a minimum size is removed. This is achieved by a combination of a fingerprint method and a cosine similarity algorithm.The novel result merging algorithm presented is based on the supposition that other evidences available to the broker could better inform the merging decision aside the rank evidence (from the centralized sample database) utilized by conventional results merging techniques. The approach combines multiple sources of evidence to inform the merging decision. We use the Boosting Tree-Based Method which would learn a function that merges results based on information that is readily available:i.e. the ranks, titles, summaries, URLs and click-through data found in the results pages. We combine these evidences by treating result merging as a multiclass machine learning problem. Experimental results demonstrate that our proposed technique has significant performance gains over the state-of-the-art approaches.
Keywords/Search Tags:Information Retrieval, Distributed Information Retrieval, Mean Variance analysis, Resource Selection, Results Merging, Search result diversification
PDF Full Text Request
Related items