Font Size: a A A

Research On Web Crawler Algorithm Based On Topic Strategy

Posted on:2009-01-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y B CaiFull Text:PDF
GTID:2178360272475126Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
As Internet expands rapidly, more and more information retrieve is done via search engine.Information retrieve from jillion data is becoming much more difficult. The core of search engine is web crawler tactics, which has been a key when it is researched and improved. To resolve many problems faced by search engine, several study areas have emerged, such as directory search engine, general search engine, meta search engine, topic search engine, AI search engine.Firstly, in this thesis, component of the search engine and main principle of web crawler are introduced, web crawling technology based on topic-specific strategy and web pages tunneling technique in accordance with assessing webpage method are analyzed, and several important algorithms of web crawler are presented, such as pagerank algorithm, HITS algorithm, fish search algorithm, shark search algorithm, best first algorithm, A* algorithm, etc. Based on the existing algorithms, a new method on assessing webpage importance is developed, webpage link analysis and text content relativity are merged in order to construct webpage core degrees and webpage radiation space. Then, webpage radiation space is combined with tunneling technique.By mathematics inducing, a heuristic search leapfrog algorithms based on topic_specifc is proposed.At last, universal search strategy for performance estimation system of topic_specific crawler is described; the difference between algorithm in existence and Leapfrog Algorithm is tested and analyzed.The first innovation work is that a new concept of webpage radiation space has been developed in this thesis. Traditional computational methods of webpage importance,pagerank and HITS, are combined.Traditional computational method of text content similarity is still regarded as an important tool, of analysing and evaluating webpage content. Compared with unitary webpage links or similarity computaion,the concept of webpage core degree has been endowed with more comprehensive meaning. Although computational complexity is added, searching range is shrinked greatly and searching precision is improved correspondly, webpage core degree meets the demand of topic-specific search. The second innovation work is that webpage tunneling algorithms are presented. Since local information is inundated with whole information,difference between overall connection and local connection is not distinguished in traditional topic_specific crawling algorithm,webpage tunnels are divided into two types, connected tunneling and non-connected tunneling,and corresponded algorithms are promoted.The third innovation work is applying A* algorithm to topic_specific crawling algorithm. While heuristic function of A* algorithm is mended with webpage radiation space and webpage tunneling, a new heuristic search algorithm, named Leapfrog Algorithm, is presented.Mathematics and experimention results indicate that the algorithm presented in this dissertation, which is called Heuristic Search Leapfrog Algorithm, can reduce response time, improve harvest rate and target recall, and polish up the performance of topic_specific search engine.
Keywords/Search Tags:Topic-specific Strategy, Search Engine, Web Crawling Algorithm, Heuristic Search, Leapfrog Algorithm
PDF Full Text Request
Related items