Font Size: a A A

Design And Implementation Of A Distribute Data Acquisition Platform For We-media

Posted on:2022-07-10Degree:MasterType:Thesis
Country:ChinaCandidate:M Z MaFull Text:PDF
GTID:2518306353467934Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
People obtain information has changed with the rapid development of Internet technology.It has become an important way for people to obtain information through major Internet content platforms.Which leads to a large number of high-quality content,articles and videos output from media users in China.This paper takes the articles and data published by we media accounts in the major content platforms of the Internet as the data collection requirements,designs a distributed data collection system,calculates the task delivery time according to the publishing habits and rules of wemedia accounts,and establishes an agile and efficient data acquisition system by integrating risk control configuration to deal with anti-crawler strategies and designing more scientific collection methods.Build a distributed data acquisition platform with agility,high efficiency and robustness to improve the efficiency and content richness of data acquisition system.This system chooses the architecture design of micro service.The data acquisition platform is divided into four modules: data acquisition module,task calculation module,task scheduling module and data processing module based on distributed system theory.n technology,spring boot framework is used to build back-end architecture,data acquisition module is based on webmagic,redis is used as task queue to realize distributed crawler,and xxl-job is integrated to schedule and manage tasks data processing,mongodb and elasticsearch are used as the major of data store and search.Kafka is used for data distribution to achieve peak clipping and asynchronous decoupling effect.All services are registered and discovered by zookeeper,and RPC is used to communicate between services by integrating dubbo.
Keywords/Search Tags:data acquisition, we media, web crawler, distributed system
PDF Full Text Request
Related items