Font Size: a A A

Research On Data Cleaning Technology With The Design And Implementation Of Data Cleaning Framework

Posted on:2017-05-27Degree:MasterType:Thesis
Country:ChinaCandidate:J WangFull Text:PDF
GTID:2308330485461313Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Under the background of the information era, the importance of the data to every industry is self-evident. The data quality problems became serious when integrated the data from the single data source, especially from the multiple data sources, due to the extensive of data source and the independence definition of data model. These problems of data quality often lead to incorrect decision and often negate the potential benefits of information-driven approaches. So the quality of the real world data has become a major concern of today. As an effective tool to improve the data quality, data cleaning technology has got the extensive research and development.In this paper, the main work includes the following three parts:In the first part, the current research status at home and abroad of the data cleaning is expounded, and the concepts of data quality are discussed, too. Also, it analyzes the classification of data quality problems and various factors that degrade it. The concepts, principles and general process of data cleaning are discussed too.The second part emphatically discussed several duplicate data cleaning algorithms, and proposed a suitable data liquidation method for this design. And then developed the cleaning processes combined with the demand of the project and the characteristics of the existing data.In the third part, a data cleaning framework is designed and realized, and the performance of the framework is tested by the test data.The design and realization of the data cleaning framework is the focus point of this design. This framework contains multiple cleaning processes; it achieves the goal of controlling the data quality preliminarily by checking the basic format and calibration values before loading data into the resource pool. This framework uses multilevel database access mode to realize the effective separation between "dirty data" and the decision-making pool, this ensured that all data in the resource pool have already experienced data cleaning process. This framework is designed by object-oriented Java language and ORACLE database, which greatly improved the ability of the cross-platform for the system. In order to making the later development and maintenance more convenient, this framework takes the modular design solution. The article finally closes with a conclusion of the research work and presents an expectation of the following research direction.
Keywords/Search Tags:Data Cleaning, Data Quality, Data Cleaning Framework, Data Cleaning Algorithm
PDF Full Text Request
Related items