Research On Topic-Oriented Web Crawling Technology Based On Semantic Analysis

Posted on:2010-06-13

Degree:Master

Type:Thesis

Country:China

Candidate:W Liu

Full Text:PDF

GTID:2178360275451481

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

In recent years, with Web information continuing to explode exponentially, traditional scalable Web crawler which can't update the information in time, meanwhile, for its much more wider crawling range,without regarding whether the gathered information is relevant to the topic or not, can not fulfill the more rigorous and prolific searching requirements from different users. Focused Web crawler also named as topic-oriented Web crawler, which collects information in specialized fields, does not need to index the Web completely.It access the Web pages that are relevant to the topic, avoiding the crisis caused by the expansion of the network information and becoming a hotspot in recent year's researches.This thesis takes great effort on the investigation on the newly focused crawler technologies in the world.To get rid of the inborn setups which caused by traditional information gathering system, a new topic-oriented crawler model have been presented according to the distributing characteristics of the topic page in the Web and the working principles of the information gathering technology. The model which based on the semantic analysis and ontology theory brings forward many advanced methods including using ontology to get field knowledge.In order to be more efficient and accurate to the topic,the model makes full use of the semantic computation to filtrate the URLs and pages obtained from Web. More presentation on the topic-oriented crawler have been discussed with the help of famous and popular open source technologies such as Heritrix.The semantic computation is the keypoint and basis of forecasting URLs and pages' filtration according to the described topics in this paper.In this paper, the new model uses the HowNet to compute the relevance among words, disambiguate the multivocal words, get the meaning collection of the extended metadata of URL links and topics and HTML pages. More details have been discussed with this semantic technology.A new KPageRank algorithm have been devised on the careful analysis of the traditional PageRank.The new algorithm integrates the content ranking strategy and the link structure ranking strategy based on the semantic computation of URLs and page text to choose and forecast much more URLs which are relevant to the topics from the Web. The traditional popular vector space model is also adopted in analyzing the relevance between pages and topics to classify and distill relevant HTML page from different fields.In order to get many more relevant pages, semantic computation which compute the relevance between page and topics also come forth after the first classification with the vector space model. The KPageRank algorithm and the semantic computation on the relevance between pages and topics are the core departments of the whole paper.In the end, the result of experitments has shown that the focused Web crawler based on KPageRank and semantic computation on the page texts, is more efficient and accurate in fetching Web pages relevant to a predefined set of topics.

Keywords/Search Tags:

Topics, Web Crawler, Relevance, KPageRank, HowNet

PDF Full Text Request

Related items

1	Research Of Hownet Based Word Semantic Computation And Application
2	Design And Implementation Of Crawler Technology For Topics
3	Design And Implementation Of Multithreading Web Crawler Oriented Topic
4	Focused Crawler Based On Ant Colony Research And Implementation
5	The Design And Implementation Of The Topic-focused Web Crawler System
6	Research On The Topic Crawler Algorithm Based On Vector Space Model
7	Research On The Topical Crawler For The Cultural Fields
8	Optimization And Implement Of The Topic Web Crawler Correlation Algorithms
9	Design And Implementation Of The Theme Crawler For Procurement Clues In The Automotive Field
10	Investigation On Web Crawler Technology Based On Hadoop Platform