Font Size: a A A

Design And Implementation Of A Metadata Service Management Platform For Multi-source Heterogeneous Big Data

Posted on:2022-02-19Degree:MasterType:Thesis
Country:ChinaCandidate:Z H TanFull Text:PDF
GTID:2518306338470344Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In multi-source heterogeneous big data,metadata refers to the description data of all the information in the process of data integration from the source system into the data warehouse.Metadata not only contains the basic description information of data,but also records the description information of data transformation source.Data lineage is to record the historical information describing the source of data transformation,which is the core difficulty and challenge in metadata service management.Due to the heterogeneity of existing big data components and the diversity of data sources,the current metadata service management still has the following problems:1)In the existing Hive data lineage analysis implementation scheme,there are problems of high coupling,poor accuracy,and low accuracy between Hives data lineage analysis function and native components;2)The heterogeneity of big data processing components makes it difficult to effectively unify the data lineage analysis of different big data processing components;3)In the multi-source heterogeneous big data scenario,there is a lack of unified management of data content or analysis results based on metadata management.To sum up,this thesis focuses on the key issues in the management of metadata services for multi-source heterogeneous big data,and the difficult problems of data lineage analysis under the big data distributed architecture.The following three aspects are carried out:1.Designed and implemented an optimization method for column-level data lineage processing based on Hive.Through the reconstruction and improvement of the original Hive data lineage processing process,the independent analysis capability of Hive SQL data lineage was realized to ensure data lineage low coupling between functions and Hive data warehouse.And combined with metadata services,the verification and replacement of metadata information in SQL is realized,and the accuracy and correctness of the results of data lineage analysis are ensured.2.Proposed and realized a unified big data lineage analysis and construction method,through the definition and abstraction of the data processing flow of the heterogeneous big data processing component and the diversified data conversion solution,the multi-source heterogeneous big data processing process is abstracted into a directed acyclic graph form,and the corresponding data lineage tracing algorithm is proposed based on this.And the processing flow of different big data processing components(such as:Hive,Spark)realizes the unified processing and construction of data lineage,so as to solve the data lineage processing challenges brought by complex and diverse big data components.3.Designed and implemented a metadata service management platform for multi-source heterogeneous big data,realized the unified management of multi-source heterogeneous metadata,and in addition to supporting basic metadata information,it also supports the collection and query of data lineage,which is convenient for users to understand and analyze the source and destination of data.Based on the unified management of multi-source heterogeneous metadata,it supports the rapid construction of data models,and establishes corresponding data model business scenarios through labeled data model binding,which realizes the data label model and data in the big data scenario unified management of services.Finally,the platform is applied to the national key R&D project "Big data credit reference and intelligent evaluation technology",which verifies the effectiveness and practicability of the platform and method.
Keywords/Search Tags:big data, data lineage, metadata management, data service
PDF Full Text Request
Related items