| With the in-depth development of digital construction of university,every university in China is speeding up the construction of its own data centre,in which the one-card platform,teaching system,scientific research management and other systems will generate a large amount of data every day.As the data storage system and data definition standards of each department are different in the early days,this has caused great obstacles to the unified management and use of data,so it is necessary to build a unified data centre and carry out data governance.Data governance is now beginning to emerge in various universities across the country,and metadata management,as an important part of data governance,is involved in the entire lifecycle of data activities.Metadata management is the core key to achieving full aggregation of data assets and deep sharing of data in universities.This thesis takes the traditional metadata management system as the background.From the demand of metadata management in universities,we design and implement a metadata management system based on big data technology,i.e.a Spark-based metadata management system.This thesis details the theoretical basis,system design and implementation and testing.The following are the main work of the thesis.1.In terms of metadata analysis: This system uses Spark SQL,the core component of Spark,to operate on the database,parse the logical plan of Spark SQL,and disassemble it to get metadata lineage.By modifying Spark SQL on Hive module,the problem of Spark SQL’s difficulty in analyzing field-level metadata is solved,and Spark’s ability to analyze field-level metadata is improved.2.In terms of metadata quality: The system detects metadata quality in terms of filling completeness,consistency,uniqueness,validity and completeness of metadata,and checks irregular data.The system runs these data quality rules to generate corresponding metadata quality reports and supports exporting reports,so that data analysts have a clear overview of the metadata quality of the system.3.This thesis designs and implements a Spark-based metadata management service,which ensures the consistency of system storage data through HDFS dual hot standby mechanism,task scheduling between cluster nodes through YARN,processing cluster computation requests through Spark,data warehouse management through Hive,and developing a web interface using HTML and Vue.js for functional The web interface was developed using HTML and Vue.js for functional interaction,enabling data administrators and data analysts to easily manage metadata.This thesis implements a Spark-based metadata management system.Through functional testing and analysis,the system meets the basic functional requirements of metadata management,and the quality of metadata can be monitored throughout.The Spark-based metadata management system can provide in-depth services for subsequent data analysis and big data governance activities: data quality monitoring,master data management and data asset management,which can further accelerate the process of big data governance in university. |