Research And Application Of Web Crawling Algorithm Based On Semantic Analysis

Posted on:2007-11-14

Degree:Master

Type:Thesis

Country:China

Candidate:J H Zhao

Full Text:PDF

GTID:2178360212957132

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

In recent year, with web information continuing to explode in all directions, traditional scalable web crawler can't keep up with the information update in time, meanwhile, for its widely crawling range, less regard whether the gathered information is relevant to the topic or not, can't fulfill the more and more rigorous and prolific search requirements from different users. Focused web crawler, which collects information in specialized fields, does not need to index the web completely. Just access the web pages that are relevant to the topic, avoid the crisis caused by the inflation of information, become a hotspot in recent year's researches.This paper takes information management system of Liaohe petroleum technique department as research background. Categorizes web spiders search strategies based on the way they evaluate and predict the links obtained from web. The principle and character of each class of searching strategy is described and the advantages and disadvantages are discussed, present a comprehensive evaluation search strategy based on semantic analysis. Combine with the strategy, give a structure design model of the topic-oriented web spider and then analyzes it in detail.Word sense disambiguation is the basic of topic semantic relativity calculation, combine with two word sense disambiguation strategies based on HowNet: strategy of category disambiguation, strategy based on semantic analysis; present a word sense disambiguation algorithm, which four relations among semdicts have been used to calculate the relevance between words, and the relevance between word and context. Therefore the aim of word sense disambiguation is achieved.In the process of relativity judging between URL and topic, semantic computation based on HowNet is presented to explore the relativity. Combine with content based crawling strategy and linked structure crawling strategy, present SPageRank (Semantic PageRank) algorithm which applied extended metadata semantic relevance algorithm for choosing and predicting URL that is relevant to the topic. The popular vector space model is used to classify HTML page from different topics. The result of experiments has shown that the web crawler based on SPageRank has more efficiency and accuracy for web pages relevant to a predefined set of topics.

Keywords/Search Tags:

Focused Web Crawler, HowNet, Crawling Strategy, Extended Metadata

PDF Full Text Request

Related items

1	Research And Implementation Of Focused Crawler
2	Research And Implementation Of Focused Crawler Based On Word2Vec
3	Research Of Focused Crawling Strategy
4	Design And Implemention Of Focused Crawler To Application Store
5	The Extension Language King Figure Focused Crawling Extractor Experimental Studies
6	The Focused Web Crawling Strategy Based On Incremental Learning
7	Research And Implementation Of A Combined Focused Crawler Based On Protocol-Driven And Event-Driven Crawling Techniques
8	The Design And Implementation Of The Topic-focused Web Crawler System
9	Study On Focused Crawling Technique For Vertical Search Engine
10	Research On Focused Hidden Web Crawler