Research And Application Of Efficient Data Acquisition Methods For Domain Data

Posted on:2022-07-05

Degree:Master

Type:Thesis

Country:China

Candidate:X J Ren

Full Text:PDF

GTID:2518306488951069

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

In the context of big data,data mining and full utilization have become the core elements for enterprises to enhance their competitiveness.At present,the main method of web data acquisition is web crawlers.For large-scale web data acquisition,traditional web crawlers have problems such as low crawling efficiency,difficult work,and security.Therefore,this paper constructs a distributed web crawler data acquisition method based on the Scrapy-redis framework.An optimized Bloom filter de-duplication algorithm is proposed,and Docker containers are used for project management;an experimental platform for efficient Web data acquisition is built for the agricultural field.The specific research results are as follows:(1)Build an efficient acquisition framework for web data.Improved the Scrapy-redis distributed crawler framework to support multi-node parallel crawling.This method uses yaml files to realize crawler configurability,uses GNE for website body extraction to improve data accuracy,and uses Docker container and Rancher platform to manage distributed crawlers.(2)A method to remove duplicates of massive URLs is proposed.In view of the high space occupancy caused by the traditional disk-based or memory-based URL deduplication method,the Bloom filter algorithm is called and multiple hash functions are used to map the element set,which reduces the memory space occupation,thereby Greatly improve the efficiency of judgment.(3)Developed an experimental system for efficient web data acquisition.Facing the agricultural field,the Ajax method,Flask framework,html,css,js and other technical tools are integrated to develop a verification system for efficient web data acquisition.By entering search keywords to match the URL of the corresponding resource,use the task management console to start or stop the crawler process,obtain the corresponding data,and display and save the data.

Keywords/Search Tags:

data collection, Distributed web crawler, Scrapy-redis framework, Bloom filter

PDF Full Text Request

Related items

1	Design And Implementation Of Distributed Web Crawler System Based On Scrapy
2	Design And Development Of Distributed Crawler Based On Scrapy Framework
3	Design And Implementation Of Distributed Crawler Project Based On Biomedical Literature Data
4	Design And Implementation Of A Distributed Crawler System Based On Scrapy Framework
5	Analysis Of Dangdang Information Based On Scrapy Framework Crawler And Data Mining
6	Design And Implementation Of Web Crawler System Based On Scrapy Framework
7	Scrapy Framework-based Web Crawler Achieved Data Capture And Analysis
8	Research And Implementation Of Distributed Web Crawler System
9	Design And Implementation Of Distributed Books Web Crawler System
10	Design And Implementation Of Search System Based On Scrapy-redis And GMM