Font Size: a A A

Research On Key Technology Of Subject Network Crawler

Posted on:2019-08-25Degree:MasterType:Thesis
Country:ChinaCandidate:J MaFull Text:PDF
GTID:2428330545960075Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet,the amount of information in Web is increasing.People often use search engines to search the Internet for desired information,such as: Baidu,Google,Sogou,etc.This kind of search engine is called a general search engine,which provides all users with all the information they want.With the increasing amount of information on the Internet,the information searched by users may be different from the information they want.For this kind of problem,we need a more professional,search engine for specific areas to solve.The topic web crawler is a key part of the vertical search engine.This article mainly studies the key technologies in the topic web crawler.This paper research content is as follows:(1)The extraction of topic content is an important step in the topic recognition of a web page.This paper,based on the distribution characteristics of the web content and the related features of the topic content,designs a method for extracting web page subject content.This method first parses the webpage into a dom tree structure,then removes the noise nodes of the webpage according to the denoising of the webpage,and finally extracts according to the distribution characteristics of the theme content in the page.(2)A topic recognition algorithm based on entity link is proposed to identify the theme of the webpage.The entity link method based on the knowledge base is applied to feature extraction.Firstly,the interface provided by the knowledge factory is used to segment the original corpora and identify entities in the corpora.Then entity links are used to obtain entity-related information.Then the potential features are extracted from the entity information into candidate feature sets,and finally used.The information gain approach picks the final feature set from the set of candidate features.Finally,the naive Bayesian classifier is trained on the web page subject using the extracted feature set.Experiments show that this method improves the accuracy of topic page recognition.(3)An improved topic search strategy based on Best-First algorithm is proposed.Topic search strategy is the key to guide the theme web crawler to crawl web pages.This paper adopts topic search strategy based on Best-First algorithm.The main idea of this strategy is to first select the most valuable link from the list of links to be crawled for crawling,then extract the links from the crawled pages,and then evaluate the value of these links if the link value is less than the setting.The threshold is discarded.Otherwise,it is placed in the queue to be fetched sorted according to the link value.This process is repeated until the crawl depth reaches the preset value or the crawl queue is empty.
Keywords/Search Tags:Theme web crawler, entity link, Best-First algorithm, topic search strategy
PDF Full Text Request
Related items