Font Size: a A A

Distributed Retrieval System With Webpage Ranking Improvement Based On Lucene

Posted on:2015-12-17Degree:MasterType:Thesis
Country:ChinaCandidate:D F ZhangFull Text:PDF
GTID:2308330464464626Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Recently, the amount of the information on Internet is increasingly large. How to fetch the target information from the data sea fast and correctly takes search engine a great challenge. In order to solve the problem above, a large-scale search engine cluster which implement distributed parallel search of the targe information should be constructed. Hence, many related technology soon appeared, such as Hadoop, Spark etc.At the same time, the information’s accurate screening and filtering is very important, which requires perfect information evaluation mechanism.Based on the above condition, this paper deeply analyzes the relevant mechanism and implementation details of the search engine, and introduces the related technologies of Hadoop and an open source tool named Lucene. This paper tries to complete the construction of a high performance search engine. First of all, to filter and dedup URLs efficiently during the period of information collection, this paper adopts the embedded database Berkeley DB to record link queues which have been processed. The fact that Berkeley DB and the calling program shares the same memory space ensure the quick access of this database. Secondly, this paper adopts the append mode to update index, which means that the new added data does not need to rebuild the entire index but generate a index file alone. Those lonely index files can be accessed independently. The merge operation will be caused when the number of additional index files reaches a threshold. This paper selects the most optimal value as a threshold through experiments, which improve the efficiency of constructing indexes. Thirdly, this paper analyzes the advantages and disadvantages of several existing Webpage scoring algorithms. A new algorithm called "term frequency position weighting and document fresh scoring algorithm” is proposed based on the Lucene’s inherent algorithm. The new algorithm not only considers the frequency of query keywords in the attention of a web page but also takes the keywords’ location in a web page and the fresh degree of the page into account, which can give a more comprehensive evaluation for the Webpages.Finally, this paper build single node search subsystems based on the open-source Java toolkit Lucene. Each subsystem with various components of a completed search engine can independently provide search service. On this basis, this paper established a small cluster service system composed of three single node subsystem by using Hadoop technology, achieving data’s distributed redundant storage and efficient parallel indexing construction. The improved algorithm is applied to the distributed system. The actual experiments show that the new algorithm is better than the inherent one in Lucene.
Keywords/Search Tags:Index Optimization, Lucene, Hadoop, Webpage Scoring
PDF Full Text Request
Related items