Font Size: a A A

Design And Implementation Of Data Quality Supervision System Based On Spark

Posted on:2021-09-20Degree:MasterType:Thesis
Country:ChinaCandidate:B WangFull Text:PDF
GTID:2518306047986889Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
We live in an era of digital transformation.The explosive growth of data information promotes the arrival of the information era of big data along with the constantly rapid development of information internet technology.Big data drives the digital industry.Therefore,all walks of life have established their own data center to do data mining and analysis based on the data warehouse at present,however,it is also accompanies by a series of data quality problems.Big data is large in volume and diverse in form,and “dirty data” is also increasing and diversfying,it will inevitably cause significant losses,if data mining analysis or application is used directly without processing.In recent years,more and more people have paid close attention to and studied the data quality,and the technology of data supervision and data cleaning has been developed to a certain extent.However,volume of business data is increasing,data types is diversifying,the storage of homologous and heterogeneous data sources is complicating,and the data is varied when it is flowing,data quality supervision based on artificial database or single computing node mode has been difficult to meet the needs of high quality data in the context of big data.In view of these problems and requirements,this paper designs and implements a big data quality supervision system based on spark computing framework,which is flexible and easy to use,has a variety of monitoring and analysis rules,supports visual operation and a variety of data cleaning algorithms.The main contents of this paper are as follows in terms of function.(1)Aiming at the problem of data quality information,the system designs and implements data quality monitoring,and realizes the monitoring of data information from overall overview to details by virtue of data exploration.On the whole,the system analyzes the fluctuation and monitors the data fluctuation periodically.Quality monitoring and multi-dimensional detail analysis are carried out in detail,data source operation is realized based on the queryable mode provided by Apache Meta Model,and multi-dimensional detail statistical analysis rules are realized by utilizing the statistical method provided by Apache Math.Finally,the data outliers are diagnosed comprehensively,and the outliers are detected intelligently based on the parallel detection of outliers concerning ISolization-Forest algorithm.(2)The system designs and implements a general data cleaning component,which includes data deduplication,null value filling,data desensitization and standardized cleaning in order to improve the data quality.The component-based development mode of Data Cleaner is adopted by the system data cleaning.A cleaning algorithm is integrated into a component,which contains one or more cleaning conversion rules.Users can expand the cleaning method according to their needs.(3)Aiming at the problem of system information management,the system designs and implements system management,job management,quality operation and maintenance management and data source management.The operation of adding,deleting,changing,alarming and scheduling settings concerning job is provided by job management.The system schedules job by using Quartz scheduling framework timing and it is submitted to the spark to parse the monitoring analysis rules or cleaning conversion rules of the job and execute them when the job is executed.The visualization of example rule results and alarm prompts is mainly provided by the quality operation and maintenance.Data source information maintenance management is mainly provided by data source management.User information management and permission settings is mainly provided by system management.The data quality supervision platform based on spark is an application system with the core functions of quality analysis,quality monitoring and data quality improvement.A simple method that is adopted to design monitoring and cleaning jobs is provided by system,which supports job timing and periodic scheduling.Users can locate and analyze data quality problems and monitor data quality by using system,and improve data quality by using data cleaning aiming at data quality problems.A series of circular supervision including detection,analysis,improvement and monitoring can be adopted by data quality during data flow,which is of great significance to data quality management.
Keywords/Search Tags:quality supervision, data cleaning, quality monitoring, Data Cleaner
PDF Full Text Request
Related items