Font Size: a A A

Research And Application Of Web Crawling Algorithm Based On Semantic Analysis

Posted on:2007-11-14Degree:MasterType:Thesis
Country:ChinaCandidate:J H ZhaoFull Text:PDF
GTID:2178360212957132Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent year, with web information continuing to explode in all directions, traditional scalable web crawler can't keep up with the information update in time, meanwhile, for its widely crawling range, less regard whether the gathered information is relevant to the topic or not, can't fulfill the more and more rigorous and prolific search requirements from different users. Focused web crawler, which collects information in specialized fields, does not need to index the web completely. Just access the web pages that are relevant to the topic, avoid the crisis caused by the inflation of information, become a hotspot in recent year's researches.This paper takes information management system of Liaohe petroleum technique department as research background. Categorizes web spiders search strategies based on the way they evaluate and predict the links obtained from web. The principle and character of each class of searching strategy is described and the advantages and disadvantages are discussed, present a comprehensive evaluation search strategy based on semantic analysis. Combine with the strategy, give a structure design model of the topic-oriented web spider and then analyzes it in detail.Word sense disambiguation is the basic of topic semantic relativity calculation, combine with two word sense disambiguation strategies based on HowNet: strategy of category disambiguation, strategy based on semantic analysis; present a word sense disambiguation algorithm, which four relations among semdicts have been used to calculate the relevance between words, and the relevance between word and context. Therefore the aim of word sense disambiguation is achieved.In the process of relativity judging between URL and topic, semantic computation based on HowNet is presented to explore the relativity. Combine with content based crawling strategy and linked structure crawling strategy, present SPageRank (Semantic PageRank) algorithm which applied extended metadata semantic relevance algorithm for choosing and predicting URL that is relevant to the topic. The popular vector space model is used to classify HTML page from different topics. The result of experiments has shown that the web crawler based on SPageRank has more efficiency and accuracy for web pages relevant to a predefined set of topics.
Keywords/Search Tags:Focused Web Crawler, HowNet, Crawling Strategy, Extended Metadata
PDF Full Text Request
Related items