Design And Implementation Of Distributed Index And Search System Based On Cloud Platform

Posted on:2012-10-06

Degree:Master

Type:Thesis

Country:China

Candidate:J D Yang

Full Text:PDF

GTID:2268330425491586

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the development of computer technology and the beginning of Internet era, the amount of information on the Internet is on explosive growth. Faced with these huge amounts of data, the indexing time will be on linear growth with the increasing of files needed to be indexed, and when there is high traffic or large amounts of index data, the search servers can not process the requests within limited time. Consequently, how to create indexes fast and how to search indexes efficiently become very crucial issues. On the other hand, the search results of current search engines(such as Google and Baidu) only contain Web page data, and do not include structured data, thus users must select a Web page to find the required structural information, and search results can not show detailed information directly, and which leads to the user experience is not ideal. Solving those two kinds of problems is extremely important to get information from the Internet.To solve the above problems, we designed and implemented a distributed index and search system with layered architecture on the cloud compute platform. First of all, for the massive volume of data to be indexed, we propose a parallel method using Lucene and running on multiple nodes of Hadoop cluster to create inverted indexs. Because multiple machines simultaneously index data, this mothod greatly accelerates the speed of indexing. Secondly, we propose a distributed retrieval method based on Katta, and successfully resolve the problems of high traffic and large scale index files slowing down search. On the one hand, the system caches previous search results at different levels, and if the cache is hit it directly returns the results, else it executes the process of search. On the other hand, the system distributes index files to many nodes of Katta cluster and stores index files for multiple copys, and multiple nodes search index files at the same time when searching, which improves the retrieval speed, reliability and scalability of the system. Then we present a search result show way which shows the structural data in the form of tree and shows Web data like Baidu and Google to improve the user query experience. Finally, through the analysis of Web data, we choose Web pages including mobile and company information to test system comprehensively. The experiments and practical application show that the designed system can quickly create indexes on the massive data and have the ability to quickly respond to queries. What is more, the query results display structured data in an intuitive way. One the whole, the system also has good scalability and fault tolerance.

Keywords/Search Tags:

cloud compute, distributed search, parallel index, Hadoop, Lucene, inverted index

PDF Full Text Request

Related items

1	Parallel Search On Ciphertext Based On Index In Cloud Computing
2	Based On Research And Optimization Lucene Inverted Index Performance
3	Design And Implementation Of WEB Of Things Search Engine Based On Hadoop
4	Research On Key Technologies Of Full-text Index Compression In Cloud Environment
5	A Study On Compression Algorithm Performance Based Inverted Index
6	A Research Of Image Retriveal Based On Lucene On The Cloud Computing Platform
7	Research And Implementation Of Inverted Index For Large-scale Visual Search
8	Study On Hadoop-based Inverted Index
9	Research And Application Of Sorting Algorithm Based On Lucene
10	Research And Implementation Of Index Technology In Domain-specific Search Engine