Font Size: a A A

Rule-Based Interactive Data Cleaning Technique

Posted on:2006-12-27Degree:MasterType:Thesis
Country:ChinaCandidate:J MengFull Text:PDF
GTID:2178360212982159Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
A new rule-based interactive data cleaning achitecture is proposed to clear up the problems such as being short of interaction,bad expansibility and being lack of metadata management in the existing data cleaning tools. So the data cleaning becomes more efficient, and data quality can be guaranteed. And the solutions of the key techniques in the architecture are also provided.Data cleaning rules and domain knowledges are expressed by formal language.It makes automatic data cleaning possible.And a new interactive technique of rule definition that rules are iteratively defined on the sample data is proposed.So not only the quantity of defined rules is guaranteed , but also the speed is raised.The implementation of rules is also expounded.Data cleaning and existing transformation tool--SEUETL are combined together to enhance their ability mutually.Because data alalysis and data cleaning were two independent steps previously supported by different tools,the users could not work with them conveniently.But both data analysis function and data cleaning function are integrated into the data cleaning architecture in this paper.Both data analysis module and duplicate elimination module are founded based by expert system which is good at expression of business rule.And the reasoning algorithms are provided.SNM algorithm is used as merge/purge method to solve duplicates in large dataset.Moreover, the paper discuss metadata management which is absolutely needed in the tool of data cleaning.
Keywords/Search Tags:data warehouse, data cleaning, ETL(Extraction/Transformation/Loading), cleaning rule, interactive, domain rule, duplicate, data analysis
PDF Full Text Request
Related items