Font Size: a A A

Research And Implement Of Distributed Focused Crawler

Posted on:2021-02-03Degree:MasterType:Thesis
Country:ChinaCandidate:W Y ShanFull Text:PDF
GTID:2428330620964173Subject:Engineering
Abstract/Summary:PDF Full Text Request
In the Internet era,how to collect information from massive data is a key issue.Currently,the most frequently used information retrieving and collecting tools are search engines based on general crawler.However,the density of information value obtained by general crawler is low.In response,some scholars have proposed the focused crawler.The focused crawler analyzes page content and selects crawling direction according the crawling strategy.Compared with general crawler,focused crawler try to avoid crawling of pages that are not related to the topic,focused crawler store fewer pages,and the value density of data which is obtained by focused crawler is higher.Focused crawler is an effective information collecting tool.The research and application of focused crawlers began in the 1990 s.So far,research results mainly include content-based crawling strategies and link structure-based crawling strategies;the former is represented by Fish Search and Shark Search,and the latter is represented by Page Rank and HITS.In addition,some scholars have proposed semantic crawlers based on thesaurus or ontology,which makes topic crawlers have the ability to perform semantic analysis in specific fields.In production applications,the representation of crawler are WebMagic,WebCollector and WebCollector-Hadoop.The semantic crawler has a certain ability to recognize synonyms and near-synonyms.which is an effective improvement on the vector space model.But the problem is,this recognition ability is limited by the thesaurus or ontology,which has certain limits.How to make focused crawler has the ability to recognize generalized synonyms and near-synonyms,and to evaluate synonyms and near-synonyms better in the process of page similarity evaluation is a research hotspot.In addition,how to reduce the time consumption of the crawler task through effective architecture design is a key issue.In order to improve the ability of focused crawler to recognize synonyms and near-synonyms.This paper proposes a similarity calculation method based on distributed word vectors.In this paper,word2 vec is used as the word vector generation model,and the model is trained through the Wikipedia corpus.The core idea of this method is to use the set of word vectors of the topic and page instead of the document vector as the basis for the similarity evaluation of topic and page.In the process of similarity calculation of topic and page,similarity of each pair of words istaken into the consideration,which makes the topic crawler have the ability to recognize synonyms and synonyms more generally,so the similarity between the page and the topic can be evaluated more comprehensively,and the accuracy rate and the recall rate of focused crawler will be improved effectively.In order to verify this method,this article selects multiple target websites and sets of subject words,tests this method and vector space model under the same conditions.In addition,in order to improve the operating efficiency of the theme crawler,this article designs a focused crawler architecture according the concept of microservice.In order to verify the validity of this architecture design,a comparative experiment was perfomed,WebMagic,WebCollector and WebCollector-Hadoop were compared with architecture that designed by this article under the same conditions.After experiments,compared to vector space model,the page similarity calculation method proposed in this paper improves the accuracy rate and recall rate of focused crawlers.In addition,compared with WebMagic,WebCollector and WebCollector-Hadoop,the crawler designed in this article complete crawler task faster under the same conditions,which the working efficiency is higher.Moreover,this article develops a distributed focus crawler system which is easy to use.Overall,the design and improvement of the focused crawler in this article is effective.The theme crawler designed in this article also has some points to be improved.Firstly,the seed URL used in the experiment was manually selected;in addition,the information such as pictures and videos on the page were not analyzed.How to automatically and intelligently select the seed URL,and effectively analyze the non-text information on the page are the future research points of the focused crawler.
Keywords/Search Tags:focused crawler, similarity, word vector
PDF Full Text Request
Related items