Research On Topic-oriented Web Crawling Algorithm

Posted on:2019-04-09

Degree:Master

Type:Thesis

Country:China

Candidate:H F Zhang

Full Text:PDF

GTID:2438330563957630

Subject:Electronics and Communications Engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet technology,the number of web pages and the network information in the Internet has rapidly increased.This phenomenon makes information retrieval an important research topic at one time.The main applications of current information retrieval include network public opinion monitoring system(NPOMS),search engine,information management system and so on.It is the key problem in the information retrieval that how to obtain the required information efficiently and quickly from the massive network information.This article mainly aims at how to choose important website monitoring under the specific theme for NPOMS.Web crawler is the core technical tool of information retrieval.The topic web crawler is used to filter topic-related information for a given topic.With the massive increase of network information,the traditional topic crawler technology tends to lower the performance of information retrieval,which leads to such problems as theme drift and time loss,and can not get the topological association of the page from the link information.In order to improve the recall rate and precision and reduce the time loss,this paper finally selects the classification key websites crawling strategies based on local topology.The idea of this strategy is to combine the network topology information with the content of the web page text,with local replace global and dynamic analysis replaces static analysis.In this paper,the topological structure of the web site is obtained through the simulation of the internet page topology.Establish classification theme standard thesaurus.Use crawler tools to crawl the web page text content and persist them locally.Using word segmentation tool to segment page text,filter stop words,extract keywords through TF-IDF algorithm to get page keyword thesaurus.Using the above two thesaurus to calculate the web page and topic relevance.Experiments were performed through a given seed page,using two parameters of page link information and topic relevancy to calculate the static evaluation of the web page to get the importance of the page.In the process of crawling,the static value of the parent page is normalized and weight the static value of the offspring page to obtain the dynamic comprehensive evaluation value of the offspring page,and then get the next generation crawler series by comparing the evaluation value.By going through above process,each site's link information is constantly learning,monotonous gradually closer to the true global topology.Eventually,we'll get the final convergence of the global optimal solution---the important sites.Through the simulation experiment to change the goal page broadcast frequency in the network,it is found that the local topology algorithm has the obvious effect of raising the recall rate and operating efficiency and the higher the density of the target website in the topic website cluster,the better the local topology algorithm works.

Keywords/Search Tags:

Best-First Search, focused crawler, Network topology

PDF Full Text Request

Related items

1	Research On Topic-oriented Web Crawling Algorithm
2	The Research On Focused Crawling Algorithm In Vertical Search Engine
3	Research On Topic Focused Web Crawler And Related Technologies
4	Research And Implementation On Focused Crawler With Search Strategy
5	Distributed Focused Crawler Based On Improved Tabu Search Strategy
6	Realization Of Focused Crawler And Research Of Its Key Technologies
7	Customizable Focused Crawler
8	Research On A Method Of Focused Crawler For Vertical Search System
9	Research And Implement Of Focused-crawler Relevance Algorithm In Search Engine
10	Research And Design On Focused Crawler Of Search Engine