Font Size: a A A

Unit-Based Focused Crawling

Posted on:2007-11-09Degree:MasterType:Thesis
Country:ChinaCandidate:R M XinFull Text:PDF
GTID:2178360182996430Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the web expanding drastically and the increasing of various formats of information, to satisfy users'information need becomes a hardship. Focused crawler is one of programs that fill the surfers'need of gathering collections of information on their interests. As a crucial part of search engine, focused crawler also schedules to gather fresh pages on given topic and update background databases. It is a common scenario when a surfer finds his interesting pages, from a starting page, he locate and click on links which lead to another page on his interests. While deciding for or against clicking on a specific link (u -> v), humans use a variety of clues on the source page u to estimate the worth of the unseen target page v, including anchor text of link referring to v, DOM tree structure of u, content of region which contains the link referring to v, and so on. Needless to say, human are good at discriminating between links based on above clues. Focused crawler imitate human behavior to differentiate those links exist in referring page (u) and always guarantee the most probable relevant page will be first visited. Compared to general-purpose web crawler which automatically traverses the web, a focused crawler is steered by a well educated classifier and traverse from page to page with the purpose of maximizing the harvest rate.We may note that a focused crawler is totally directed by its classifier and thereby the accuracy of classifier will heavily influence the harvest rate of focused crawler. In other words, the harvest rate will mainly depend on how well the classifier was educated. Unlike traditional pure text classifier which employs some classic algorithm (SVM, NB) on training instance without any preprocess on text, HTML page classifier must first parse and extract HTML page into pure text and then adopt classic algorithm. During the process of parsing, extracting html page and eliminating noise, we may frequently encounter below cases in pages. Web page, especially for commercial page, usually consists of many information blocks. Apart from the main content, it usually has some irrelative or noise...
Keywords/Search Tags:Unit-Based
PDF Full Text Request
Related items