At present,the rapid development of Internet technology,Cloud computing and Internet of Things technology has made human society enter the era of big data.In the context of the era of big data,big data credit reporting applies big data technology to the credit investigation industry,changing the way of data collection,processing and analysis.At the same time,data of higher dimensions and different levels are used for credit score modeling,and the potential value of data is constantly mined.However,the application of massive data also brings some challenges to big data credit reporting:(1)data usually comes from different institutions,has different formats,and has the characteristics of multi-source heterogeneity.However,the existing data synchronization tools have poor universality and need to be improved in real-time incremental synchronization.(2)It is difficult to trace data lineage:The introduction of big data components such as Spark and Flink makes the data processing process strongly associated with the computing engine,and conventional methods are not accurate enough,increases the difficulty in extracting data lineage.(3)Poor data quality:data recording is arbitrary,with data from logs,texts and other formats,and data integrity and standardization are not guaranteed.In view of the problems of difficult data aggregation,poor data quality and difficult data traceability in credit data reporting,in order to expand the scope of credit investigation data integration,detect the quality of credit investigation data,and better play the value of credit investigation data,this paper plans to design a data governance system for big data credit reporting by studying key technologies of data governance.It mainly includes the following research contents:(1)Proposed and implemented data synchronization job construction methods and tools that support offline and real-time data.Research on data aggregation methods and technologies of multi-source data,design and implement a construction method and system that can simultaneously support offline and real-time data synchronization operations,optimize the configuration process of data synchronization operations,and realize unified configuration of multiple data synchronization methods.(2)Proposed and implemented a lineage analysis method for Flink SQL.Aiming at the defects of high coupling,high invasiveness and poor accuracy of existing consanguinity analysis methods,this paper studied and implemented local parsing of Flink SQL,verified and replaced its parse tree with metadata,and realized low invasiveness of consanguinity analysis function and accuracy of parsing results.(3)Design and implement a data governance system for big data credit reporting.By studying the relevant concepts and technical schemes of data governance,the data governance system for big data credit reporting is designed and implemented to realize the integration and synchronization of multi-source data,improve the quality of credit investigation data through data governance,and provide good data support for the data analysis and research of individual or enterprise credit investigation business.This paper finally achieved a data management system,and this system provides the metadata management,data synchronization,data quality management support.After verification and testing,the system achieved in this paper meets expectations,has good versatility and scalability,it has been applied and verified in "Intelligent Evaluation and Open Platform of Big data credit reporting" in the National Key R&D Program of China "Big Data Credit Investigation and Intelligent Evaluation Technology",and has certain reference significance for data governance in the big data creditc industry. |