Font Size: a A A

Research On Key Technology Of Full-Text Retrieval Based On Distributed Computing

Posted on:2015-11-29Degree:MasterType:Thesis
Country:ChinaCandidate:J R GuoFull Text:PDF
GTID:2298330467463779Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The rapid spread of information networks, especially the arrival of the era of big data, makes all kinds of heterogeneous unstructured data begin to appear on the Internet with explosive growth. The technology of search engine provides a good solution for people to retrieve useful information from the mass of data quickly and effectively. Full-text retrieval is the most critical technologies in a search engine application, which mainly consists of two main processes-index creating and index query. Traditional centralized Full-text retrieval has been far from satisfying the demand for fast-growing Web pages. With rapid promotion in the cloud storage platform and distributed computing technology, distributed full-text retrieval is starting to become an important technology in modern information retrieval area, along with addressing various issues in centralized retrieval.Firstly, this paper introduces the background and development status of distributed full-text retrieval. Then we describe and analyze some key technologies involved in the establishment, organism in diving and organizing and queries of distributed full-text index. In this basis, it raises some our own solutions of several key issues related to distributed full-text retrieval, and ultimately verify the validity of the methods by related experiments.The content of this paper can be mainly included into the following three aspects:1. Full-text indexing is a key factor in the overall construction of full-text retrieval system. After analyzing the shortcomings of single centralized index building approach, this paper proposes a distributed computing framework based on the Map Reduce parallel index construction method, and realize the establishment method with Lucene index. Ultimately we built a Hadoop cluster by four machines to establish the index by doubled efficiency, verifying the effectiveness of the method.2. Partition and organizing of distributed index determines the load balancing across distributed systems. This paper analyzes and compares the two current mainstream partition methods, including term-based partition and document-based partition. As to Lack of correlation between the topics for their organization, we proposes an index partition method based on text clustering after reading of the relevant literature. This method builds a good foundation for the document collection selection and distributed retrieval in the later process.4. Due to the large number of distributed index databases, we need to choose some highly related document collections during retrieval process. There are some more classic set of collections selecting strategies, such as CORI, CRCS, etc. But they lack of support in semantics generally. This paper presents a collection selecting strategy in the distributed information retrieval based on similarity calculation over terms and finally we verify the effectiveness of it by a high recall rate.
Keywords/Search Tags:distributed index, text clustering, term similarity, resource selection
PDF Full Text Request
Related items