Font Size: a A A

Study On The Data Lake Model Of Science-from The Perspective Of Data Governance

Posted on:2022-09-15Degree:MasterType:Thesis
Country:ChinaCandidate:L Z RuanFull Text:PDF
GTID:2518306722471824Subject:Master of Library and Information
Abstract/Summary:PDF Full Text Request
Due to the explosive growth of big data including scientific research data in recent years,large data organizations are seeking to create new data storage architectures and scalable storage platforms to effectively respond to new data management challenges.These data management challenges are mainly to ensure the availability of data from various sources in multiple formats.Changes in data scale have led to the emergence of new data analysis and management systems.In particular,data lakes as a general data storage environment,can store almost any type of data.It also allows analysts and scientists to apply the most appropriate analysis engine and tools to each original data set.This article focuses on exploring how the data storage architecture called "data lake" can be combined with scientific data management to improve the service quality of scientific data repositories.First,this article introduces the limitations of traditional data warehouses in dealing with the latest changes in the data paradigm.We evaluated the capability boundaries of the current mainstream scientific data storage platforms and the data life cycle determined by the data storage organization to determine whether the current scientific data storage services can cover all the requirements.At the same time,we discussed and compared different open source and commercial platforms that can be used to develop data lakes,as well as the latest research progress on various levels of data lake functions.Finally,referring to the real needs in scientific data services,from the perspectives of data life cycle,data discovery and acquisition,data processing and analysis,etc.,try to build a data lake prototype for scientific data storage.The prototype uses the Hadoop data platform.On the distributed file system and Elastic Search retrieval tools and Spark data processing tools to describe our scientific data lake design and implementation methods.Finally,a software platform and tools are used to implement a data lake development example containing sample data for data flow extraction,display and multi-layer flow analysis.This research can provide a reference for scientific data repositories that plan to implement data lake solutions for specific cases.
Keywords/Search Tags:Data Lake, Science Data, Data Governance
PDF Full Text Request
Related items