Font Size: a A A

Research On Data Cleaning Based On Science And Technology Innovation Big Data Public Platform

Posted on:2019-08-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y Q ZhaoFull Text:PDF
GTID:2428330563990348Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Along with the fast development of science and information technology,based on the database,data warehouse,which is able to meet the needs of organizational managers' decision analysis,is generated.However,during the process of construction,there are data quality problems,which ultimately affect the analysis result.Therefore,so as to ensure data quality,before the data is stored into the data warehouse,data cleaning must be carried out.Based on the "Hebei province science and technology research project of standardization of big data technology and application system development "(172110113D)and the "project of Hebei province science and technology innovation of large public data platform".In view of the characteristics of science and technology innovation big data,a data cleaning framework is constructed,cleaning rules are formulated,the cleaning algorithm is researched and implemented,and experiments are performed on real data sets to achieve robust data quality.The main research work of this paper is as follows.(1)Construct Data Cleaning FrameworkAnalysing business requirements,development environment,framework design principles,and untangling the business process relationship between the cleaning rules and cleaning algorithm.In view of the differences of "dirty data" and the repairability of data,the data cleaning rules that business need should be dynamically configured.Further based on the proposed data cleaning algorithm,the "dirty data" need to be cleaned again to ensure the quality of the data flowing into the data warehouse.(2)Formulate Data Cleaning RulesIn order to formulate data cleaning rules suitable for big data of scientific and technological innovation,the basic concept of "dirty data" is redefined.Based on the theory of paradigm and set correlation,the dimension reduction of the multidimensional data table is processed according to the relation between the tables in different data sources.Aiming at the difference characteristics of "dirty data" and data business specification,we define the cleaning rules of null values,exceptions and redundant data,and formulate a preliminary "dirty data" repair rule.(3)Research and Implement Data Cleaning AlgorithmThe attribute properties of the fields in the data tables are analyzed,and the similar repeated records are defined in combination with the data dictionary.Based on the Sorted Neighborhood Method and Multi-Pass Sorted Neighborhood,the weight of each field is allocated for the entry point of the similarity between records.Analysing business characteristics of big data of scientific and technological innovation,and setting the judgment thresholding of repeated records.According to the relationship between the field weight and the judgment threshold,the repeated records are selected,the data quality is analyzed,and the data cleaning algorithm is implemented in the system.At last,experiments are conducted on real data sets.The results show that the proposed data cleaning rules and algorithm can effectively complete the cleaning task of science and technology innovation big data,and improve data quality.
Keywords/Search Tags:data cleaning, data warehouse, cleaning rules, cleaning algorithm, data quality
PDF Full Text Request
Related items