Distributed Retrieval System With Webpage Ranking Improvement Based On Lucene

Posted on:2015-12-17

Degree:Master

Type:Thesis

Country:China

Candidate:D F Zhang

Full Text:PDF

GTID:2308330464464626

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Recently, the amount of the information on Internet is increasingly large. How to fetch the target information from the data sea fast and correctly takes search engine a great challenge. In order to solve the problem above, a large-scale search engine cluster which implement distributed parallel search of the targe information should be constructed. Hence, many related technology soon appeared, such as Hadoop, Spark etc.At the same time, the information’s accurate screening and filtering is very important, which requires perfect information evaluation mechanism.Based on the above condition, this paper deeply analyzes the relevant mechanism and implementation details of the search engine, and introduces the related technologies of Hadoop and an open source tool named Lucene. This paper tries to complete the construction of a high performance search engine. First of all, to filter and dedup URLs efficiently during the period of information collection, this paper adopts the embedded database Berkeley DB to record link queues which have been processed. The fact that Berkeley DB and the calling program shares the same memory space ensure the quick access of this database. Secondly, this paper adopts the append mode to update index, which means that the new added data does not need to rebuild the entire index but generate a index file alone. Those lonely index files can be accessed independently. The merge operation will be caused when the number of additional index files reaches a threshold. This paper selects the most optimal value as a threshold through experiments, which improve the efficiency of constructing indexes. Thirdly, this paper analyzes the advantages and disadvantages of several existing Webpage scoring algorithms. A new algorithm called "term frequency position weighting and document fresh scoring algorithm” is proposed based on the Lucene’s inherent algorithm. The new algorithm not only considers the frequency of query keywords in the attention of a web page but also takes the keywords’ location in a web page and the fresh degree of the page into account, which can give a more comprehensive evaluation for the Webpages.Finally, this paper build single node search subsystems based on the open-source Java toolkit Lucene. Each subsystem with various components of a completed search engine can independently provide search service. On this basis, this paper established a small cluster service system composed of three single node subsystem by using Hadoop technology, achieving data’s distributed redundant storage and efficient parallel indexing construction. The improved algorithm is applied to the distributed system. The actual experiments show that the new algorithm is better than the inherent one in Lucene.

Keywords/Search Tags:

Index Optimization, Lucene, Hadoop, Webpage Scoring

PDF Full Text Request

Related items

1	Design And Implementation Of University Digtil Library System Based On Hadoop
2	Design And Implementation Of WEB Of Things Search Engine Based On Hadoop
3	Based On Research And Optimization Lucene Inverted Index Performance
4	Design And Implementation Of Distributed Index And Search System Based On Cloud Platform
5	The Design And Implementation Of A CBIR System Based On Hadoop And Lucene
6	Research And Application Of Sorting Algorithm Based On Lucene
7	Organization Entity Information Extractor From Webpage Base On CRF
8	The Research And Implementation Of The Searching System Based On Special Informations In The Internet Environment
9	Research On Information Extraction And Full Text Retrieval Of Crop Diseases Articles
10	A Research Of Image Retriveal Based On Lucene On The Cloud Computing Platform