Font Size: a A A

Design And Implementation Of Full Text Retrieval System Based On Mapreduce

Posted on:2015-12-19Degree:MasterType:Thesis
Country:ChinaCandidate:H Z TaoFull Text:PDF
GTID:2308330482460185Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In the 1990s, computer network represented by the Internet is high technology used by a small amount of researchers. But soon, together with computer hardware it comes into the lives of ordinary people with astonishing speed. Meanwhile, the data generated by computers becomes larger. How to find useful information in a large amount of data becomes the mission of the researchers. A number of general search engines occur in this process, such as Google, Yahoo, Baidu, etc. With the development of search engines, the situation shows that they are controlled by a handful of companies, but general search engines often can not meet the needs of users in institution.In this thesis, distributed framework Apache Hadoop and full-text search engine Lucene are used to design distributed search engine in a relatively large-scale LAN. Project is divided into three parts. The first part works on indexing the source files and storing index files in HDFS. A classification strategy is proposed, and depending on the strategy files are indexed and stored separately, which solves the problem of storing in stand-alone environment and at the same time category feature is saved. In the second part, basing on the keyword submitted by users, search engine searches on the index files created in the first part and returns results to the user. This paper proposes the method of using the remote procedure call to solve the problem of obtaining the correlation factors used by Lucene grading formula in the distributed environment. In part Ⅲ, users’ historical results are used to provide quick retrieval service.In this thesis, drawing lessons from the idea of the operating system cache, and taking advantage of the user’s retrieval history, quick retrieval project is proposed. This project focuses on frequently accessed files and indexes them separately from others. And when user selects quick retrieval plan, the frequently accessed files’ index files will be searched. The system uses global full text retrieval and quick retrieval. Actually it achieves the basic functions of search engines in a distributed environment. The system has been put into practical use.
Keywords/Search Tags:MapReduce, Full Text Retrieval, Category Based Index, Lucene, Retrieval Efficiency
PDF Full Text Request
Related items