Font Size: a A A

Research On Topic-Oriented Web Crawling Technology Based On Semantic Analysis

Posted on:2010-06-13Degree:MasterType:Thesis
Country:ChinaCandidate:W LiuFull Text:PDF
GTID:2178360275451481Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, with Web information continuing to explode exponentially, traditional scalable Web crawler which can't update the information in time, meanwhile, for its much more wider crawling range,without regarding whether the gathered information is relevant to the topic or not, can not fulfill the more rigorous and prolific searching requirements from different users. Focused Web crawler also named as topic-oriented Web crawler, which collects information in specialized fields, does not need to index the Web completely.It access the Web pages that are relevant to the topic, avoiding the crisis caused by the expansion of the network information and becoming a hotspot in recent year's researches.This thesis takes great effort on the investigation on the newly focused crawler technologies in the world.To get rid of the inborn setups which caused by traditional information gathering system, a new topic-oriented crawler model have been presented according to the distributing characteristics of the topic page in the Web and the working principles of the information gathering technology. The model which based on the semantic analysis and ontology theory brings forward many advanced methods including using ontology to get field knowledge.In order to be more efficient and accurate to the topic,the model makes full use of the semantic computation to filtrate the URLs and pages obtained from Web. More presentation on the topic-oriented crawler have been discussed with the help of famous and popular open source technologies such as Heritrix.The semantic computation is the keypoint and basis of forecasting URLs and pages' filtration according to the described topics in this paper.In this paper, the new model uses the HowNet to compute the relevance among words, disambiguate the multivocal words, get the meaning collection of the extended metadata of URL links and topics and HTML pages. More details have been discussed with this semantic technology.A new KPageRank algorithm have been devised on the careful analysis of the traditional PageRank.The new algorithm integrates the content ranking strategy and the link structure ranking strategy based on the semantic computation of URLs and page text to choose and forecast much more URLs which are relevant to the topics from the Web. The traditional popular vector space model is also adopted in analyzing the relevance between pages and topics to classify and distill relevant HTML page from different fields.In order to get many more relevant pages, semantic computation which compute the relevance between page and topics also come forth after the first classification with the vector space model. The KPageRank algorithm and the semantic computation on the relevance between pages and topics are the core departments of the whole paper.In the end, the result of experitments has shown that the focused Web crawler based on KPageRank and semantic computation on the page texts, is more efficient and accurate in fetching Web pages relevant to a predefined set of topics.
Keywords/Search Tags:Topics, Web Crawler, Relevance, KPageRank, HowNet
PDF Full Text Request
Related items