Font Size: a A A

Research On Topic-oriented Web Crawling Algorithm

Posted on:2019-04-09Degree:MasterType:Thesis
Country:ChinaCandidate:H F ZhangFull Text:PDF
GTID:2438330563957630Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,the number of web pages and the network information in the Internet has rapidly increased.This phenomenon makes information retrieval an important research topic at one time.The main applications of current information retrieval include network public opinion monitoring system(NPOMS),search engine,information management system and so on.It is the key problem in the information retrieval that how to obtain the required information efficiently and quickly from the massive network information.This article mainly aims at how to choose important website monitoring under the specific theme for NPOMS.Web crawler is the core technical tool of information retrieval.The topic web crawler is used to filter topic-related information for a given topic.With the massive increase of network information,the traditional topic crawler technology tends to lower the performance of information retrieval,which leads to such problems as theme drift and time loss,and can not get the topological association of the page from the link information.In order to improve the recall rate and precision and reduce the time loss,this paper finally selects the classification key websites crawling strategies based on local topology.The idea of this strategy is to combine the network topology information with the content of the web page text,with local replace global and dynamic analysis replaces static analysis.In this paper,the topological structure of the web site is obtained through the simulation of the internet page topology.Establish classification theme standard thesaurus.Use crawler tools to crawl the web page text content and persist them locally.Using word segmentation tool to segment page text,filter stop words,extract keywords through TF-IDF algorithm to get page keyword thesaurus.Using the above two thesaurus to calculate the web page and topic relevance.Experiments were performed through a given seed page,using two parameters of page link information and topic relevancy to calculate the static evaluation of the web page to get the importance of the page.In the process of crawling,the static value of the parent page is normalized and weight the static value of the offspring page to obtain the dynamic comprehensive evaluation value of the offspring page,and then get the next generation crawler series by comparing the evaluation value.By going through above process,each site's link information is constantly learning,monotonous gradually closer to the true global topology.Eventually,we'll get the final convergence of the global optimal solution---the important sites.Through the simulation experiment to change the goal page broadcast frequency in the network,it is found that the local topology algorithm has the obvious effect of raising the recall rate and operating efficiency and the higher the density of the target website in the topic website cluster,the better the local topology algorithm works.
Keywords/Search Tags:Best-First Search, focused crawler, Network topology
PDF Full Text Request
Related items