Font Size: a A A

Focused Web Crawling Technology

Posted on:2003-04-04Degree:MasterType:Thesis
Country:ChinaCandidate:S T LiFull Text:PDF
GTID:2178360185995506Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With information rapidly expanding in the Web, many Web services accordingly boom up. As a basic foundation and important component of these services, Web crawling is applying in the fields search engine, site structure analysing, Web graph evolution, users' interests mining, and individual information retrieval. However, facing with people requesting more and more rigorous and prolific, traditional scalable Web crawling technology do not satisfying people's needs well. It can not gather data adequately and timely, or can not meet the individuation requirement accurately. Thus, we get into the research on how to crawl information effectively in some sections of Web, which is also called focused web crawling technology.Based on the long-time accumulation in the field of web crawling, and combining the current developing technology on the focused web crawling, this article bring forward a structure design model of the focused web crawler, which is mainly including topic choosing, initial url selecting, spider crawling, page analysing, relativity judging between url and topic, and relativity judging between page content and topic. With the problems in the research process we advance several new rules, arithmetics and principles as follows:on the hub characteristic, linkage/sibling locality characteristic, topic -in-site characteristic, and tunnel characteristic, summing up the rules of the distribution of topic on the Web.presenting the topic choosing methods.Adopting the client/server structure for Spider, and realizing the distributed, high effective information crawling.Based on the analysing HTML syntax, describing the extracting arithmetic of title, hyperlink, abstract, content.In the course of the relativity judging between url and topic, we, based on extensive metadata methods UH, AMH, RW, RWB and hyperlink analysis method PageRank, developing the arithmetic IPageRank.In the course of the relativity judging between page content and topic, applying the term-based vector space model.The experiment results show that our work is effective and our system has a very strong application value,expecially in IPageRank algorithm of the relativity judging between url and topic, which has a comparatively evident breakthrough.
Keywords/Search Tags:Web, Information Crawling, Information Gathering, Topic, Limited, Search Engine, PageRank, IPageRank
PDF Full Text Request
Related items