Unit-Based Focused Crawling

Posted on:2007-11-09

Degree:Master

Type:Thesis

Country:China

Candidate:R M Xin

Full Text:PDF

GTID:2178360182996430

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the web expanding drastically and the increasing of various formats of information, to satisfy users'information need becomes a hardship. Focused crawler is one of programs that fill the surfers'need of gathering collections of information on their interests. As a crucial part of search engine, focused crawler also schedules to gather fresh pages on given topic and update background databases. It is a common scenario when a surfer finds his interesting pages, from a starting page, he locate and click on links which lead to another page on his interests. While deciding for or against clicking on a specific link (u -> v), humans use a variety of clues on the source page u to estimate the worth of the unseen target page v, including anchor text of link referring to v, DOM tree structure of u, content of region which contains the link referring to v, and so on. Needless to say, human are good at discriminating between links based on above clues. Focused crawler imitate human behavior to differentiate those links exist in referring page (u) and always guarantee the most probable relevant page will be first visited. Compared to general-purpose web crawler which automatically traverses the web, a focused crawler is steered by a well educated classifier and traverse from page to page with the purpose of maximizing the harvest rate.We may note that a focused crawler is totally directed by its classifier and thereby the accuracy of classifier will heavily influence the harvest rate of focused crawler. In other words, the harvest rate will mainly depend on how well the classifier was educated. Unlike traditional pure text classifier which employs some classic algorithm (SVM, NB) on training instance without any preprocess on text, HTML page classifier must first parse and extract HTML page into pure text and then adopt classic algorithm. During the process of parsing, extracting html page and eliminating noise, we may frequently encounter below cases in pages. Web page, especially for commercial page, usually consists of many information blocks. Apart from the main content, it usually has some irrelative or noise...

Keywords/Search Tags:

Unit-Based

PDF Full Text Request

Related items

1	Performance Analysis And Design For Parllel Joint/Kinematical Unit Of Bionic Robot
2	Fully Customized High-performance Arithmetic Logical Unit Of Study Design
3	Research On High Efficiency Algorithms Of Reducing Complexity Of HEVC
4	Research On Unit Selection In Large-Corpus Based English Text-to-Speech System
5	Modeling,Simulation And Experimental Study Of Dual Drive Unit Automated Guided Vehicle Based On Trajectory Correction
6	The Design Of Hybride Encryption Module In Network Processing Unit
7	Design And Realization Of Instruction Unit Of X_DSP
8	Research On Key Technologies Of VLSI Implementation Of Adaptive Filtering Algorithm
9	Mobile Organizational Network Physical Layer Receiving Unit Channel Parameter Estimation And Decoding Unit
10	Research And Design Of Mobile Terminal RF System Of The L Band Maritime Satellite