Research On Data Cleaning Based On Science And Technology Innovation Big Data Public Platform

Posted on:2019-08-21

Degree:Master

Type:Thesis

Country:China

Candidate:Y Q Zhao

Full Text:PDF

GTID:2428330563990348

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Along with the fast development of science and information technology,based on the database,data warehouse,which is able to meet the needs of organizational managers' decision analysis,is generated.However,during the process of construction,there are data quality problems,which ultimately affect the analysis result.Therefore,so as to ensure data quality,before the data is stored into the data warehouse,data cleaning must be carried out.Based on the "Hebei province science and technology research project of standardization of big data technology and application system development "(172110113D)and the "project of Hebei province science and technology innovation of large public data platform".In view of the characteristics of science and technology innovation big data,a data cleaning framework is constructed,cleaning rules are formulated,the cleaning algorithm is researched and implemented,and experiments are performed on real data sets to achieve robust data quality.The main research work of this paper is as follows.(1)Construct Data Cleaning FrameworkAnalysing business requirements,development environment,framework design principles,and untangling the business process relationship between the cleaning rules and cleaning algorithm.In view of the differences of "dirty data" and the repairability of data,the data cleaning rules that business need should be dynamically configured.Further based on the proposed data cleaning algorithm,the "dirty data" need to be cleaned again to ensure the quality of the data flowing into the data warehouse.(2)Formulate Data Cleaning RulesIn order to formulate data cleaning rules suitable for big data of scientific and technological innovation,the basic concept of "dirty data" is redefined.Based on the theory of paradigm and set correlation,the dimension reduction of the multidimensional data table is processed according to the relation between the tables in different data sources.Aiming at the difference characteristics of "dirty data" and data business specification,we define the cleaning rules of null values,exceptions and redundant data,and formulate a preliminary "dirty data" repair rule.(3)Research and Implement Data Cleaning AlgorithmThe attribute properties of the fields in the data tables are analyzed,and the similar repeated records are defined in combination with the data dictionary.Based on the Sorted Neighborhood Method and Multi-Pass Sorted Neighborhood,the weight of each field is allocated for the entry point of the similarity between records.Analysing business characteristics of big data of scientific and technological innovation,and setting the judgment thresholding of repeated records.According to the relationship between the field weight and the judgment threshold,the repeated records are selected,the data quality is analyzed,and the data cleaning algorithm is implemented in the system.At last,experiments are conducted on real data sets.The results show that the proposed data cleaning rules and algorithm can effectively complete the cleaning task of science and technology innovation big data,and improve data quality.

Keywords/Search Tags:

data cleaning, data warehouse, cleaning rules, cleaning algorithm, data quality

PDF Full Text Request

Related items

1	Research On Data Cleaning Technology With The Design And Implementation Of Data Cleaning Framework
2	Research Of Data Cleaning Method Based On Data Warehouse
3	Research And Application Of Data Cleaning In Guizhou Local Tax Projects
4	Data Bank Data Warehouse Build Process Of Cleaning And VIP Clients Of The Excavation
5	Rule-Based Interactive Data Cleaning Technique
6	Some Main Technology's Research Of Data Cleaning
7	The Research And Application Of Data Cleaning Technique
8	Research Of Large Amount Of Data In Chinese Commodity Cleaning Method Of The Algorithm Based On The SNM
9	Data Cleaning In Data Integration
10	The Research And Application Of Duplicated Records And Incomplete Data's Cleaning Approach