Font Size: a A A

The Mechanism Of Using Data Lineage Improving Data Cleaning Qulity

Posted on:2005-03-10Degree:MasterType:Thesis
Country:ChinaCandidate:L Y ZhongFull Text:PDF
GTID:2168360122992735Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of communication technology and network technology, Internet has gathered plenty of Data Sources together to form a huge distributed heterogeneous DataBase Environment. More and more applications need integrate structured heterogeneous Data Sources, while a lot of existing Data Sources are still semi-structureed or unstructured. How to get consistent and high quality data is the main question to improve the efficiency of all kinds of applications.Data cleaning is efficient technology to improve the Data Quality and is well known in the area of decision support systems and data warehouses. However, there are still many inadequance in terms of practice. First, the process of data cleaning tool: ETL (Extraction Transformation Loading) is unidirectional and cann't trace the cleaning mid-result; Second, the process lack suitable explain mechanism to the cleaning result; Third, the process haven't interactive mechanism to tune the data cleaning program.Considering the inefficiencies above, in this paper we introduce data lineage mechanism into the process of data cleaning. The mechanism can trace each cleaning mid-result through all oprators and provide an interactive interface to correct the cleaning data so that we can adjust the data cleaning algorithm and parameters of external function in time to reach the best quality standard.We first constructed five traceable opratores and confined the detailed syntax of these five opratores so that we can explain each result of opratores and can construct the multi-opratores data cleaning program.Through propagate the key information of each opratores, the mechanism of data lineage can trace the data cleaning program, which construct over the traceable oprators so that we can analysis and explain the cleaning result. According to the analysis and research of the incremental mode and conflict of data modification, we can correct and improve the exceptions which appear during the data modification.For there is much data to deal with in data cleaning program, so using machine learning technology during the exception correction and the further research on the relation between clustering and incremental mode is the direction of the future work.
Keywords/Search Tags:Data lineage, Data cleaning, Data quality
PDF Full Text Request
Related items