Font Size: a A A

Heterogeneous Data Sources Integration In Research And Application Of The Cleaning Strategy,

Posted on:2005-10-23Degree:MasterType:Thesis
Country:ChinaCandidate:H G ZhouFull Text:PDF
GTID:2208360125957140Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Since the demand of data integration existed for a long time, the research on data integration technology has been always a popular subject in the field of data management research and other fields involved. The thesis studies the problems to be solved in the process of heterogeneous data source integration, focusing on the data cleaning strategy and the algorithm involved. It provides a universal solution to clean dirty data for guaranteeing the quality of integrated data.The thesis puts forward the basic strategy to combine data cleaning with schema transformation by the use of integrating tools, for the sake of data integration; it also designs the universal framework of heterogeneous data source integration, offering a new way to improve the function of integrating tools in processing data. Categorized dirty data into single-record one and multi-record one according to the difference in cleaning measures, and then solved the cleaning of single-record dirty data by utilizing cleaning rules. As to multi-record one, the thesis studies the cleaning strategy of two common kinds: incomplete data and duplicate records. For the former, a processing measure based on strategy pattern is offered, which achieves three kinds of algorithm: simple processing, KNN and DTB. And for the latter, the framework of object identification processing is put forward, and the relevant algorithms are designed for the processes of data pre-processing, morphological analysis, record character tagging, similarity analysis and object clustering inside the framework.Finally, the data cleaning strategy is applied to Universal Customer Information System (UCIS). The results verify the feasibility and effectiveness of the cleaning strategy and the related algorithm through the integrating and cleaning experiment in UCIS.
Keywords/Search Tags:data cleaning, data integration, data quality, dirty data
PDF Full Text Request
Related items