Font Size: a A A

Improvement And Implementation Of Vertical Search Engine Based On Nutch

Posted on:2017-02-27Degree:MasterType:Thesis
Country:ChinaCandidate:X H QinFull Text:PDF
GTID:2348330518995267Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the increase of network resources and the development of network technology,traditional search engine have some disadvantages,for example,messy search result,long update cycle,and query ambiguity.More and more users hope to find what they need of a particular field quickly and efficiently,the general search engine has been unable to meet user's specialized search requirements.At the same time,vertical search engine is specialization and efficiency,has become the focus of research.Nutch is an open source search engine system,which has the advantages of high transparency,fair result fairness,flexible program setting,stable operation.Therefore,our research choose Nutch as the platform to realize our vertical search engine.In this paper,we first introduce the research status of vertical search engine,then analyzes the basic working principle of search engine and the workflow of Nutch.Analyze the advantage and disadvantage of classic rank models like PageRank and Hits to lay the foundation of improve the rank strategy of search result.After study the mechanism of Nucth,we put forward a way to improve the rank algorithm.Due ti the ranking results haven't take into account of topic relevance,in this paper we use LDA topic model to extract the center words of a page,and then calculate the relevancy of center words and the theme,so as to effectively measure the relevance of document and theme.In addition,the PageRank algorithm is improved by adding topic relevance score to make it adaptable in vertical scene.Based on the above research work,this paper designs a vertical search engine system on tourism field,which is divided into three modules:data collection,indexing and retrieval.In the data collection module,use Nutch crawler to crawl the web pages and realize an interface to parse different formats of pages.In the index module,first build a library of tourism theme,then use IKAnalyzer to segment Chinese sentences,after that use LDA model to extract center word of a page,then calculate the relevancy of theme and center word,finally only the topic related pages are indexed.In the search module the improved rank algorithm is achieved.In the last chapter,measurement like TopN precision and sort effect are used to analyze the performance of the improved system.
Keywords/Search Tags:vertical search engine, Nutch, rank algorithm, PageRank, LDA
PDF Full Text Request
Related items