Focused Web Crawling Technology

Posted on:2003-04-04

Degree:Master

Type:Thesis

Country:China

Candidate:S T Li

Full Text:PDF

GTID:2178360185995506

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With information rapidly expanding in the Web, many Web services accordingly boom up. As a basic foundation and important component of these services, Web crawling is applying in the fields search engine, site structure analysing, Web graph evolution, users' interests mining, and individual information retrieval. However, facing with people requesting more and more rigorous and prolific, traditional scalable Web crawling technology do not satisfying people's needs well. It can not gather data adequately and timely, or can not meet the individuation requirement accurately. Thus, we get into the research on how to crawl information effectively in some sections of Web, which is also called focused web crawling technology.Based on the long-time accumulation in the field of web crawling, and combining the current developing technology on the focused web crawling, this article bring forward a structure design model of the focused web crawler, which is mainly including topic choosing, initial url selecting, spider crawling, page analysing, relativity judging between url and topic, and relativity judging between page content and topic. With the problems in the research process we advance several new rules, arithmetics and principles as follows:on the hub characteristic, linkage/sibling locality characteristic, topic -in-site characteristic, and tunnel characteristic, summing up the rules of the distribution of topic on the Web.presenting the topic choosing methods.Adopting the client/server structure for Spider, and realizing the distributed, high effective information crawling.Based on the analysing HTML syntax, describing the extracting arithmetic of title, hyperlink, abstract, content.In the course of the relativity judging between url and topic, we, based on extensive metadata methods UH, AMH, RW, RWB and hyperlink analysis method PageRank, developing the arithmetic IPageRank.In the course of the relativity judging between page content and topic, applying the term-based vector space model.The experiment results show that our work is effective and our system has a very strong application value,expecially in IPageRank algorithm of the relativity judging between url and topic, which has a comparatively evident breakthrough.

Keywords/Search Tags:

Web, Information Crawling, Information Gathering, Topic, Limited, Search Engine, PageRank, IPageRank

PDF Full Text Request

Related items

1	Design & Practice Of Topic-Specific Search Engine System
2	A Parallel System Of Incremental Web Information Retrieval
3	Topic Search Engine Key Technology Research
4	Search Engine Optimization Method Based On Pagerank
5	Research And Implementation Of The Strategy-Extensible Search Engine
6	Research On Topic Web Page Crawling Strategy For Vertical Search Engine
7	The Design And Implementation Of Vertical Search Engine Based On Lucene
8	Research Of Uighur Information Search Engine Based On Heritrix
9	Alert Based On Search Engine Technology Research And Implementation Of Information - Gathering System
10	Study On Topic-Specific Web Information Collection And Analysis Technology