Font Size: a A A

Research On Crawling Techniques Of Focused Search Engine

Posted on:2012-03-16Degree:MasterType:Thesis
Country:ChinaCandidate:K JiangFull Text:PDF
GTID:2218330362960125Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the fast development of Internet, there has been a rapid growth of networkresources. People find that the timeliness and accuracy of using general search enginefor the focused information is becoming worse. Focused search engine only collects theinformation associated with a specific area in order to build a focused Web informationresource database, so that it has high practical value and broad prospect in applicationarea. This paper mainly discusses the techniques of focused crawler which plays animportant role in focused search engine.This paper first describes the basic theory of general crawler and focused crawler,including the architecture, and principles. Then we analyze the distribution features ofWeb information. In order to improve the precision of search engine, we firstly need toremove the Web page noise to extract the body of the page. After describing text theorymodel and the existing advantages and disadvantages of various noise reductionalgorithms, we propose a novel noise reduction algorithm based on visual rules toreduce the computational complexity of crawler systems and improve the precision ofsearch engine.Because existing crawling strategies cannot resolve'Topic Drift'and'Topic Island',this paper presents a crawling strategy based on dynamic tunneling. By adding linkanalysis to content analysis in the process, new crawling strategy can solve the'TopicDrift'better. By deep treatment of low relevancy of links in the course of link predictionand change the crawling depth in the direction of low relevant links, the new crawlingstrategy can partly resolve'Topic Island'and can raise the recall rate of the focusedsearch engine.Finally, based on the analysis of the open source software Nutch, this paper givesthe design and realization of the proposed algorithms, and tests on various crawlingstrategies. The results have revealed that the new noise reduction algorithm andcrawling strategy can significantly improve the precision and recall rate of the focusedsearch engine.
Keywords/Search Tags:Focused crawler, Noise Reduction, Topic Drift, Topic Island, Tunneling
PDF Full Text Request
Related items