Font Size: a A A

Research On Topical Spider

Posted on:2011-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:Y F ChenFull Text:PDF
GTID:2178360305482207Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, there are more and more resources on the Internet, the ability to get the information needed becomes very important. Because universal search engine has some shortcomings, subject-specific vertical search engine has become a research focus. As the data sources of vertical search engine,topical spider plays a key role in the system. Through the analyzes of automotive web pages, a topical spider named CarSpider was achieved.In CarSpider system, eigenvector composed of keywords was used to describe the theme. ODP directory and search engine were used to select vehicle-related and authoritative URLs as seed links. Because html code was irregular, html Tidy was used to normalize page codes and DOM model was used to pretreat the web pages. CarSpider distinguished types of web pages and used different methods to extract the content. URL analysis method,crawling history and statistical methods were used to determine the type of web pages. In the process of extracting relative information,vecotr space model was used to calculate relevance between content blocks and the theme. In order to improve the efficiency of crawling, content-based and network structure-based method were used to predict the relevance between URL and the theme,and self adaptive crawling algorithm was proposed.In content-based method,three levels(website-level,block-level and link-level) were used to calculate thematic relevance in content-based method;In network structure-based method,because PageRank algorithm was theme insensitive and was not available in focused crawling,CarSpider combined topic relativity and PageRank to make theme-related pages get higher PageRank value.In respect of replicas detection,Bloom Filter was used to eliminate reduplicative URLs and eigenvector was used to eliminate reduplicative pages.At last,CarSpider was tested from two aspects:crawling speed and accurancy.Through the analysis of experimental data, good results were obtained.
Keywords/Search Tags:Search Engine, Topical Spider, DOM Tree, Vector Space Model
PDF Full Text Request
Related items