Font Size: a A A

The Study And Realization Of Vertical Search Engine Oriented On The Car Subject

Posted on:2011-05-29Degree:MasterType:Thesis
Country:ChinaCandidate:N ZhangFull Text:PDF
GTID:2178360305960900Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Internet has been a space with huge information. People mainly use general search engine such as Baidu to search information in the internet, although this kind of search engine is strong enough to meet people's basic requirement, it doesn't have the good ability of providing the information oriented on subject for search engine users. The appearance of vertical search engine has been put forward to solve this kind of problem.This article firstly introduces the feature of vertical search engine, and its work principle, then it deeply analyzes the system constructure of open source web crawler called Heritrix. Basing on the above analysis, this article puts forward to design specific extractor to extract specific website and expand the link processor of Heritrix to crawl specific links, so as to realize the customized crawl; then it puts forward to eliminate the effect which robots.txt has made to some processors and add hash algorithm, so as to realize the crawl with high efficiecy and multi threads.This article uses Lucene as the fulltext search engine, it firstly analyzes the system constructure of Lucene, and fully expounds the reverse index and index structure of Lucene. Then it leads us to know that the Lucene original sequence algorithm just thinks about the content of web pages, which cannot indicate the importance of web pages, so it adds PageRank algorithm which is based on the link analysis, which has improved the orginal sequence algorithm of Lucene, because of the improvement, the sequence result satisfy the search engine user's expectation.Basing on the above research results and according to car fans'ordinary requirement when they search for car information, this article establishes an verticle search engine oriented on car subject. It designs each subsystem, and realizes the improved crawler and sequence algorithm.Finally, basing on the established vertical search engine system, this article makes some tests. First of all, through the search test, it has verified the direct advantage of vertical search engine over general search engine. Then through the contrast between the original crawler and the improved crawler in the crawl speed, and analysis of the crawl speed of the improved crawler in different amounts of threads and in different running time, it has verified the improved crawler has better efficiecy. At last, through the constrast of sequence result between the orignal sequence algorithm and the improved sequence algorithm, it has verified the improved sequence algorithm has made better progress in the aspect of satifying the search engine user's expectation.
Keywords/Search Tags:vertical search engine, web crawler, Lucene, PageRank
PDF Full Text Request
Related items