Font Size: a A A

Research On Some Key Technologies And Software Platform Of Data Cleaning

Posted on:2006-05-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:W ChenFull Text:PDF
GTID:1118360152989395Subject:Aviation Aerospace Manufacturing Engineering
Abstract/Summary:PDF Full Text Request
Nowadays, information technologies are gaining more and more popularity in China, and information projects such as ERP (Enterprise Resource Planning) system, E-Governance system, medical insurance information system and other information systems are implemented in many areas. So, enterprises have accumulated a lot of data in this process, and these data are of great value to them. However, data quality is influenced by many factors such as user input error, enterprises incorporation, changes of enterprises environment and so on. In order to make information systems more useful, their data quality should be improved. Therefore, the study on data cleaning has gained its theoretical and practical value. Data cleaning is studied in this dissertation, and the main contributions are as follows: The importance and necessity of the present study are presented. The research actuality of data cleaning is analyzed, and the problems in data cleaning research process are indicated. According to the three important factors of data quality, the key technologies of data cleaning in single data sources, including duplicated records cleaning, incompleteness data cleaning and incorrectness data cleaning, are studied. In order to clean approximately duplicated records, a synthetically cleaning method of approximately duplicated records cleaning is given. Two ameliorations are used to improve detecting precision and efficiency of approximately duplicated records cleaning. To improve the detecting precision, each field of record is appointed a proper weight through using rank-based weights method in the process of approximately duplicated records detecting. To improve the detecting efficiency, a method is proposed. This method uses length-filtering method in approximately duplicated records detecting, and the edit distance computation that is not needed can be avoided. Furthermore, an effective experimental environment is set up, and plenty of experiment results are produced through a great many experiments. Based on all of these experiment results, the efficiency and rationality of the length-filtering method are validated. In order to clean incompleteness data, an efficient incompleteness data cleaning method is proposed. This method first detects whether each record in data set is useful, and then deletes the useless record. Finally, the missing data in useful record is disposed using appropriate method. Thus, incompleteness data is cleaned. In order to clean incorrectness data, how to use outlier detecting and business rule to detect incorrectness data is studied. Through using multiplicate incorrectness data detecting methods, synthetically cleaning effect is improved. Based on research on data cleaning in single data sources, data cleaning problems in multiple data sources are researched. Main content includes: Methods of data standardization are studied. An approximately duplicated entity cleaning method is given based on research on approximately duplicated records cleaning. This method can resolve the problems of approximately duplicated entity cleaning in multiple data sources. An interactive data migration method is proposed. This method hangs data migration and data cleaning together tightly, and does not only migrates the data from original system to new system agilely and rightly, but also can ensure the data quality of new information system after data migration. Based on research on structured data cleaning, aiming at the importance of semi-structured data XML in data cleaning, an efficient XML approximately duplicated data cleaning method is proposed. Besides, an efficient approximately detecting algorithm based on tree edit distance is given, and is optimized by the lower and upper bounds of tree edit distance. This algorithm can detect approximately duplicated data efficiently. So, foundations are built for researching XML approximately duplicated data cleaning. An extensible data cleaning software platform is proposed on developing data cleaning tool, which has open rules library and algorithms library. R...
Keywords/Search Tags:information system, data quality, data cleaning, rules library, algorithms library, software platform
PDF Full Text Request
Related items