Font Size: a A A

Research And Design Of Filtrating Web Crawler

Posted on:2008-04-28Degree:MasterType:Thesis
Country:ChinaCandidate:F ChenFull Text:PDF
GTID:2178360242978863Subject:Systems Engineering
Abstract/Summary:PDF Full Text Request
Web crawler is a system which can automatically get web pages from Internet。It helps searching engine download web pages, so it is an important part of searching engine. Web crawler of normal searching engine starts working from some seeding links, and that web crawler of searching engine for special domain is able to identify links and content of web pages except functions of normal web crawler, so we call it focused web crawler. The main goals of focused web crawler are to get more web pages which are correlative with a certain topic and prepare data for users querying. The focused web crawler has been became a researching hotspot in technology domain of searching engine.We research the focused web crawler from another aspect—"filtrating technology", so we call this web crawler as the filtrating web crawler. Firstly, we introduce the main function of the web crawler and the present condition of technology of web crawler; secondly, we research the technology of filtrating web crawler from two aspects: (1)from filtrating links, we give the concept of links'colony and classify links'colony as single pattern and multiple pattern; at the same time we give the filtrating links algorithm after analyzing the traditional algorithm;(2)from filtrating content of web pages, we research it from three aspects: (a)put forward a method to differentiate the style of website basing the characteristic of content,(b) use a method basing on calculating the weight of tag to select the characteristic words of web pages, and then we construct the VSM of web pages to calculate the similarity with the topic VSM which we have prepared,(c) basing on analyzing the process of classifying non-structural data, we use the native bayes classifier to differentiate the topic types of web pages; lastly, we design and implement a filtrating web crawler system, and introduce the main module and technology of this system.
Keywords/Search Tags:Web Crawler, Pattern Matching, Classification Methods
PDF Full Text Request
Related items