Research And Implement Of Distributed Focused Crawler

Posted on:2021-02-03

Degree:Master

Type:Thesis

Country:China

Candidate:W Y Shan

Full Text:PDF

GTID:2428330620964173

Subject:Engineering

Abstract/Summary:

In the Internet era,how to collect information from massive data is a key issue.Currently,the most frequently used information retrieving and collecting tools are search engines based on general crawler.However,the density of information value obtained by general crawler is low.In response,some scholars have proposed the focused crawler.The focused crawler analyzes page content and selects crawling direction according the crawling strategy.Compared with general crawler,focused crawler try to avoid crawling of pages that are not related to the topic,focused crawler store fewer pages,and the value density of data which is obtained by focused crawler is higher.Focused crawler is an effective information collecting tool.The research and application of focused crawlers began in the 1990 s.So far,research results mainly include content-based crawling strategies and link structure-based crawling strategies;the former is represented by Fish Search and Shark Search,and the latter is represented by Page Rank and HITS.In addition,some scholars have proposed semantic crawlers based on thesaurus or ontology,which makes topic crawlers have the ability to perform semantic analysis in specific fields.In production applications,the representation of crawler are WebMagic,WebCollector and WebCollector-Hadoop.The semantic crawler has a certain ability to recognize synonyms and near-synonyms.which is an effective improvement on the vector space model.But the problem is,this recognition ability is limited by the thesaurus or ontology,which has certain limits.How to make focused crawler has the ability to recognize generalized synonyms and near-synonyms,and to evaluate synonyms and near-synonyms better in the process of page similarity evaluation is a research hotspot.In addition,how to reduce the time consumption of the crawler task through effective architecture design is a key issue.In order to improve the ability of focused crawler to recognize synonyms and near-synonyms.This paper proposes a similarity calculation method based on distributed word vectors.In this paper,word2 vec is used as the word vector generation model,and the model is trained through the Wikipedia corpus.The core idea of this method is to use the set of word vectors of the topic and page instead of the document vector as the basis for the similarity evaluation of topic and page.In the process of similarity calculation of topic and page,similarity of each pair of words istaken into the consideration,which makes the topic crawler have the ability to recognize synonyms and synonyms more generally,so the similarity between the page and the topic can be evaluated more comprehensively,and the accuracy rate and the recall rate of focused crawler will be improved effectively.In order to verify this method,this article selects multiple target websites and sets of subject words,tests this method and vector space model under the same conditions.In addition,in order to improve the operating efficiency of the theme crawler,this article designs a focused crawler architecture according the concept of microservice.In order to verify the validity of this architecture design,a comparative experiment was perfomed,WebMagic,WebCollector and WebCollector-Hadoop were compared with architecture that designed by this article under the same conditions.After experiments,compared to vector space model,the page similarity calculation method proposed in this paper improves the accuracy rate and recall rate of focused crawlers.In addition,compared with WebMagic,WebCollector and WebCollector-Hadoop,the crawler designed in this article complete crawler task faster under the same conditions,which the working efficiency is higher.Moreover,this article develops a distributed focus crawler system which is easy to use.Overall,the design and improvement of the focused crawler in this article is effective.The theme crawler designed in this article also has some points to be improved.Firstly,the seed URL used in the experiment was manually selected;in addition,the information such as pictures and videos on the page were not analyzed.How to automatically and intelligently select the seed URL,and effectively analyze the non-text information on the page are the future research points of the focused crawler.

Keywords/Search Tags:

focused crawler, similarity, word vector

Related items

1	Research On Search Strategy And Key Techniques Of Focused Crawler
2	Research On The Key Technology Of Focused Crawler System
3	Research On Topic Focused Web Crawler And Related Technologies
4	Research And Implementation Of Focused Crawler
5	Research On Focused Crawler Based On SVM Classification Algorithm
6	Research And Implement Of Focused-crawler Relevance Algorithm In Search Engine
7	Design And Implementation Of Distributed Focused Crawler System For Text Data
8	The Design And Implementation Of The Topic-focused Web Crawler System
9	Focused Crawler Based On Domain Ontology And Similarity Concept Context Graph
10	Research And Implementation Of On Semi-automatic Ontology Construction Base On WordNet And Focused Crawler