Font Size: a A A

Research On Methods And System Platform For Big Data Quality Detection And Repair

Posted on:2021-04-29Degree:MasterType:Thesis
Country:ChinaCandidate:Y QiFull Text:PDF
GTID:2428330647951055Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the current big data era,various industries have generated and accumulated massive amounts of data.According to statistics,global data reserves have exceeded 10 ZB in the past five years,and maintain an annual growth of about 40%.With the help of data analysis,artificial intelligence and data mining,lots of potential value can be discovered.However,the collection,transfer,analysis of these data may occur information leakage and semantic changes,resulting in “dirty” data and thus affecting data quality,which seriously restricts the development of big data.Therefore,data quality issues are gaining more and more attention,and related technologies and systems such as ETL(Extract-Transform-Load),data cleaning and data quality supervision are constantly emerging.Nevertheless,these systems still have many deficiencies in data cleaning capabilities and computing performance,making it difficult to cope with the complex situations of practical big data applications.Firstly,big data is huge,and it is hard to process large-scale data only with stand-alone resources.Secondly,big data has scattered souces and different formats,making it difficult to represent and store data uniformly.Furthermore,there are many heterogeneous types of big data,with outstanding data quality problems and different solutions.Defining and dealing with all the issues in a unified manner is quitely hard.For the above problems,this paper carried out research on data cleaning and repair technology and platform for big data,and proposed a generalized data quality management model and framework.On this basis,this paper implemented a distributed data quality detection and repair system Spark DQ,providing effective methods for handling data quality problems and allowing users to perform efficient data quality detection and repair on various “dirty” data in the underlying heterogeneous large-scale data sources.The main work and contributions of this paper are as follows:(1)This paper proposed a generalized big data quality management model and coding framework,and provided a series of detection and repair interfaces.By this model and these interfaces,users can quickly build custom data quality detection and repair tasks for different data quality needs.(2)Based on the above data quality management model and coding framework,this paper implemented a complete set of parallel data quality detection and repair algorithms,including data quality detection algorithms of integrity,uniqueness,consistency and validity dimensions,and data quality repair algorithms based on filling,deleting,filtering and replacing.Using these algorithms,various data quality problems in practice can be solved efficiently,such as data loss,rule mismatches and constraint conflicts.(3)In order to improve the efficiency of complex data quality management algorithms in the big data scenario,which take a long time to complete,this paper designed and implemented their corresponding parallel algorithms,including prioritybased multi-CFD(Conditional Functional Dependency)detection and repair algorithm,entities detection and extraction algorithm based on semantic information and blocking technology,and missing value filling algorithm based on Na?ve Bayes model.(4)To further improve the performance of data quality detection and repair tasks,for the underlying mechanism of different algorithms,this paper proposed multi-task execution scheduling optimization and data state cache optimization.Considering the calculation characteristics and relationship between tasks comprehensively,the overall operating efficiency of multiple detection and repair tasks can be optimized.(5)On the basis of above key technongies,this paper designed and implemented the unified big data quality detection and repair system Spark DQ,and provided auxiliary functions of data profiling and constraint suggestion.The experimental results show that parallel algorithms can be improved to 4 to 12 times compared to stand-alone versions,the automatic scheduling optimization can improve the task performance by 9% to 56%.Besides,this system has a nearly linear scalability.
Keywords/Search Tags:Data quality management, Data governance, Distributed system, Performance optimization, Automatic scheduling
PDF Full Text Request
Related items