Font Size: a A A

The Research And Implementation Of Topical Web Crawler Based On Improved Shark-Search Algorithm

Posted on:2016-04-14Degree:MasterType:Thesis
Country:ChinaCandidate:W Y ZhangFull Text:PDF
GTID:2308330461983088Subject:Software engineering
Abstract/Summary:PDF Full Text Request
It had been an important subject that how to extract data we need from the internet information since the net was bom. As a solution to the problem, topical web crawler was mainly implemented in such ways:heuristic method based on text content, method based on evaluation of web linking structure, text classifiers based on machine learning, etc. The Shark-Search algorithm, typically heuristic method, has been widely used because it is simple, efficient and extendible. But the algorithm will cause "myopia problem" and "tunneling problem" lead to low recall. To improve those shortcoming, this paper implemented an improved Shark-Search crawler named NSKD(New Shark-search with Keywords Diffusion) crawler by topic keywords diffusion and URL dispatch strategy.The improving had been taken in two aspects:(1) topic keywords diffusion based on "Tongyici Cilin extended edition" and improved its words similarity algorithm. While computing topic similarity, the NSKD crawler use improved words similarity algorithm to get distances between every web page context keywords and keywords of appointed topic, and gather them into a matrix called "topic distance matrix" then project the matrix into a vector for compare. The cosine distance between compare vector and topic feature vector is seen as the topic similarity. The NSKD changed the simple matching way to computing topic similarity of Shark-Search, and extended comparison of surface which make it is possible that highly similar context with less matched keywords get more appropriate evaluation. (2) The paper implement a URL dispatch strategy based on level statistics which scatter the over gathered processing scope by comparing URL’s level between average level in queue and the URL’s currently handling. This strategy was made to better the "tunneling problem".Finally, The paper tested the NSKD crawler by two group of experience:(1) use "topic classified news reduce edition"(SogouC.Reduce.20061127) released by sogou lab(http://www.sogou.com/labs/dl/c.html) as data, to test the availability of topic keywords diffusion algorithm, the result shows it can evidently distinguish topic text and non-topic text.(2) crawl the famous Longteng translation BBS(http://www.ltaaa.com/bbs) for testing rate of recall and precise of the NSKD crawler. The result shows that rate of recall increased by 32%, and with keep precise stable.
Keywords/Search Tags:Web Crawler, topic similarity, Text Mining, Search Engine, Tongyici Cilin
PDF Full Text Request
Related items