Font Size: a A A

Design And Implementation Of Federal Data Management System Based On Data Lake

Posted on:2021-03-20Degree:MasterType:Thesis
Country:ChinaCandidate:S J WangFull Text:PDF
GTID:2428330614971954Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid development of big data,machine learning,5G and other technologies,the scale of data is getting bigger and bigger,the volume of data is growing geometrically,and the sources and types of data are more diverse.At the same time,new business lines need to be added constantly as the enterprise grows,but the data warehouse created by different business lines can form a closed data center,resulting in the existence of multi-source heterogeneous data,which is the main challenge to the enterprises today.These challenges create many problems.The first is data inconsistencies and difficult to use.Because of the heterogeneous nature of data from multiple sources,it is difficult to identify valid data from multiple data sources,and data consistency cannot be guaranteed.The second is the lack of a sound system for measuring the value of data,which makes it difficult to assess the contribution and impact of data assets in a comprehensive manner.In addition,data warehouses are built according to different business lines,making it difficult to interoperate with each other,forming data silos and uncovering hidden data values.This project aims to break down data silos and form a data federation through a data lake methodology to unify the management of multisource heterogeneous data in the enterprise.After an in-depth investigation of the current state of enterprise big data development,and a full understanding of enterprise needs and business challenges,we designed and implemented a federal data management system.First,we analyze the widely used data sources within the enterprise,extract metadata from the multi-source heterogeneous data,abstract a unified metadata model,and at the same time conduct in-depth verification of the metadata model to ensure the compatibility of semi-structured and unstructured data and other database sources,which lays the foundation for unified metadata management.Then design and implement three modules: data source management,metadata management,and unified query system,in which data source management provides the function of accessing data source,collecting metadata information and mapping to the unified metadata model,metadata management provides the function of data usage rights management,and unified query system implements the function of data query based on Spark.The Federal Data Management System makes it possible to unify the management of enterprise multi-source heterogeneous data,realize the standardized process from metadata collection,metadata model mapping,metadata management to data usage,and provide a unified query perspective and usage mode for data,supporting the joint query of multi-source heterogeneous data.At present,the Federal Data Management System is up and running,providing stable and reliable services for enterprise users,and interfacing with several downstream business systems such as the enterprise's internal big data computing platform and reporting platform,satisfying enterprise data development needs and creating great value.
Keywords/Search Tags:Data Lake, Metadata Management, Multi-source Heterogeneous Data, Spark
PDF Full Text Request
Related items