Font Size: a A A

Design And Implementation Of Big Data Integrated Storage And Governance System For Multi Scenarios

Posted on:2021-10-31Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2518306308969759Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In the field of science and technology service industry,to build a big data service platform for science and technology consultation,it is necessary to integrate,manage and integrate multi scenario data from cross applications(enterprise/industry,patent/literature,economy/information,etc.)and different sources(web crawler,database,document,etc.).In the process of building the big data platform,the following problems should be solved:(1)the platform needs to crawl open web application data in multiple vertical fields,and the current mainstream crawler framework has the problems of repeated coding and inconvenient management.(2)For the integration of multi scene data,the existing data integration tools are not universal enough to unify the data integration process,and the real-time incremental synchronization and data integrity need to be improved.(3)Because data sources are diverse,quality is uneven,and heterogeneous in network,equipment,storage and other aspects,it is challenging to clarify the meaning of data and improve the quality of data,which hinders the process of data asset.In order to solve the above problems in the construction of big data platform for scientific and technological consultation,this paper focuses on the integrated storage and management of big data for multi scenarios as follows:(1)for the needs of customizable crawlers for multi web applications,The customizable distributed network crawler subsystem is designed and implemented based on Kafka Connect and WebMagic;(2)According to the requirement of data integration of multiple scenarios(web crawler,database,file)under big data,a unified data integration subsystem for multiple scenarios is designed;(3)Aiming at the requirement of uniform governance for heterogeneous data sources in network,equipment and storage under big data,a uniform data governance subsystem is designed,which realizes uniform access to heterogeneous data sources,uniform metadata acquisition,synchronization and management,and realizes data cleaning and data fusion based on hive batch processing system.In addition,the subsystem also implements a classification label management based on graph database,which associates the cleaned data with labels.Through experimental verification,the crawler subsystem is customizable and easy to manage,which supports the customization and task management of crawler tasks for the data of different web applications without coding.The data integration subsystem optimizes the data integration process,which has the advantages of good generality,incremental synchronization support,data integrity,etc.The data governance subsystem plays an active role in clarifying business meaning and improving data quality,and promotes data capitalization.The system implemented in this paper has good generality and expansibility,and can be used for reference in the construction of big data platform in multi data source scenarios.
Keywords/Search Tags:data integration, distributed web crawler, kafka, data governance, metadata management
PDF Full Text Request
Related items