Font Size: a A A

Distributed Focused Crawler Based On Improved Tabu Search Strategy

Posted on:2021-04-23Degree:MasterType:Thesis
Country:ChinaCandidate:Y P GuFull Text:PDF
GTID:2428330647952830Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Information on the Internet is complex and diverse,and users need to rely on search engines to obtain the required field information.Focused crawler(FC)provides key technical support for information retrieval and is committed to crawling as many topic-related pages as possible from the Web.Therefore,it is particularly important that how to improve the global search capability of FC technology and design an efficient,stable,and accurate crawler system.The damage caused by the frequent occurrence of rainstorm disaster and typhoon disaster in meteorological disasters is immeasurable.There are many textual information related to meteorological disasters in the Web.In order to efficiently and accurately obtain information about rainstorm disasters and typhoon disasters in many web pages,with the theme of rainstorm disaster and typhoon disaster,this paper designs a distributed focused crawler(DFC)system based on the big data platform of Hadoop.The main research contents and methods are as follows:1)Considering the topic description problem in FC technology,a method based on domain ontology description topic is proposed.Firstly,the thematic semantic vector is obtained by constructing the domain ontology and computing ontology sematic similarity,and the webpage text feature vector is obtained by summing the product of the weights and the normalized termfrequencies in different HTML positions.Then,the vector space model is used to compute the topic relevance for web pages.In order to analyze the comprehensive priority of link,the topic relevance of the link's anchor text and the PR value of web page that the link points to are also calculated.The results of the focused crawler experiment with the theme of rainstorm disaster and typhoon disaster show that this method can effectively prevent the "Topic drift" and improve the accuracy of the crawler system.2)Considering the crawling strategy problem in FC technology,this paper proposes a focused crawler(On-ITS)method combing ontology and improved Tabu search.The global ontology and local ontology are used to filter the links multiple times,and the retroactive tunnel traversal method is added to widen the crawler's search path and improve the global search ability of the crawler system.Finally,a focused crawler method(RO-ITS)combining On-ITS method and retrospective tunnel traversal method is derived.Based on the theme of heavy rain disasters and typhoon disasters,comparing the experimental results of the crawler method proposed in this paper with other algorithms in the literature,we find that the RO-ITS strategy can crawl more web pages related to the topic.3)Considering the crawling efficiency problem in FC technology,a DFC system based on the Hadoop platform is built and implemented.By introducing the RO-ITS strategy into the Map Reduce calculation model,the page scraping module,page parsing module,and link processing module of the system are designed and implemented,and HDFS is used to store the data.After experimental tests,the DFC system designed in this paper runs stably and has a high crawl rate.Compared with the stand-alone crawler system,its web crawling efficiency is significantly improved.
Keywords/Search Tags:Tabu search, Ontology, Focused crawler, Hadoop, Meteorological disaster
PDF Full Text Request
Related items