Font Size: a A A

Research On Big Data Governance Platform Of Multi Data Sources Based On Spark

Posted on:2021-04-18Degree:MasterType:Thesis
Country:ChinaCandidate:J N FanFull Text:PDF
GTID:2428330602474329Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the continuous popularization of big data technology and applications,enterprises pay more and more attention to the value and quality of data.However,the traditional methods of data management based on data managers and data analysts can no longer meets the needs of data governance,using computer information systems for data governance has received great attention from various enterprises,among them,the most popular data processing technologies is Hadoop and Spark.Since the design focus of Hadoop is to solve the problem of batch operations in big data scenarios,therefore,nowadays Hadoop's high-latency data access properties make it no longer suitable for high-demand which diversification of computing modes and real-time processing of big data.At the current stage of data governance,the volume of the data is relatively large,and there are many types of structured and unstructured data,this causes usable and valuable data to be mixed with huge amounts of data,but existing data governance platforms often only support a single data source or structured data.To solve the above problems,this paper will build a big data governance platform of multi data sources based on Spark,import and manage heterogeneous multiple data sources,strictly control data quality through operations such as data cleaning,data aggregation,and quality audit,solve the problem of real-time data processing,provide high-quality data to follow-up data analysis or other platforms,the Web service is developed based on the HTML5+Vue framework,which is compatible with multiple browsers and has strong system universality..The main content of this paper includes the following three aspects:Firstly,this paper builds a big data governance platform based on Spark,formulate corresponding data governance rules to complete data governance task in a specific scenario.Data governance services based on the Spark platform can efficiently process data streams based on memory,at the same time it supports multiple advanced algorithms,and it no need for packaging,uploading clusters,verification,etc.,greatly improving the efficiency and speed of data processing.Secondly,through integrating heterogeneous multiple data sources management services to perform unified management and maintenance of massive data sources,this makes it easy for users to quickly establish and manage data sources.By standardizing data model(metadata),users can create standard models,add model fields,and configure attribute rules.While using the original data,users do not directly manipulate the original data database to ensure data security,integrity and normal operation of the data sources platform.Finally,perform data cleaning,data aggregation and quality audit services on data according to actual business needs.Clean duplicate data,low quality data,etc.from different data sources;realize the exchange and aggregation of data information of distributed and heterogeneous data sources through the graphical configuration interface;According to the audit strategy,the integrity,consistency,accuracy,reasonableness,timeliness,uniqueness and other standards of the data are verified to ensure the quality of the data.At last,import high-quality data into the database for use by the data analysis platform.
Keywords/Search Tags:Spark, Data Governance, Real-time, Multi-source Data
PDF Full Text Request
Related items