Font Size: a A A

Research Heritrix And Vertical Search Engine Based On Lucene

Posted on:2014-04-14Degree:MasterType:Thesis
Country:ChinaCandidate:C C ZhangFull Text:PDF
GTID:2268330401973348Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, the number of pages on the web increased exponentially, the information contained in data more and more wider, people want to find the information they need on the Internet has become increasingly difficult, and then the search engine appeared. But people often use general search engines like Baidu, Google,also they can help people to search some information, but in particular demand, they can not meet people’s needs. Therefore, the vertical search engine springing up in large numbers.This article modified the crawl logic of Web crawler Heritrix, also improved and expanded its capabilities, make it to crawl the content of a specific page on the designated website, and then through research Lucene basic sorting algorithm, and drawing on the PageRank algorithm thinking, to improve it.Firstly, we introduced the technology used in vertical search engines, including web crawler technology, institutions of extraction technology, Chinese word segmentation technology, indexing and search technology.Secondly, we also detailly descriped the crawl configuration steps of improved web crawler Heritrix, extended its function from four aspects.Thirdly, we intruduced PageRank algorithm thinking, improved Lucene basic sorting algorithm to make it suitable for web sort algorithm, and the algorithm is given to achieve.Finally, from the real-life problem,we taked the digital camera as example, used the practical technology to designe and implement a vertical search engine, and taked the part of the improvements made in the text to do a comparison test in the engine.
Keywords/Search Tags:Heritrix, Lucene, PageRank algorithm, page ranking
PDF Full Text Request
Related items