Font Size: a A A

Research Of Data Cleaning Method Based On Data Warehouse

Posted on:2005-05-04Degree:MasterType:Thesis
Country:ChinaCandidate:Z F ZhouFull Text:PDF
GTID:2168360122471117Subject:Computer applications
Abstract/Summary:PDF Full Text Request
In current world, the requirement of enterprise informationization is much more immient, in which an important aspect is management of enterprise data. Based upon the principle of "Garbage in, garbage out", data managed demand reliable, no mistake and truly reflecting actual enterprise situation for supporting to make right decision. Therefore management of data quality acquires increasing attention. However, data cleaning is a significant method to improve data quality.The application of data warehouse mainly represents the degree of enterprise informationization. Data warehouse is subject-oriented, compositive, non-renewed and change unceasing along with time data sets. Data warehouse is the basis of decision-making, so validity of data in data warehouse is vital for avoiding makeing wrong decision. In many caces, the data in data warehouse which derive from multi-operation data source. However, data source is likely stored on different hardware platform and use different OS. As a result, the data from these data source absolutely exist inconsistent data. The objective of data cleaning is to solve data quality issue due to the reason hereinbefore. Thus data cleaning is regarded as one of the most important prolems for creating data warehouse. One situation of data quality issue is a realistic entity being represented by several not complete same records, called approximately duplicated records. Examining and eliminating approximately duplicated records is one of main problems needed solve for data cleaning and improving data quality. The process of exploring approximately duplicated records can be intitled record matching process.On the basis of analyzing current problems existing in data cleaning, especially after abundant researching on exploring and eliminating approximately duplicated records, this paper brings forward record matching method and eliminating approximately duplicated records method based on RDBMS, expecting to eliminate approximately duplicated records in data warehouse. By doing experiments on large database, the methods that we proposed in thispaper are proved efficient in eliminating approximately duplicated records.
Keywords/Search Tags:data warehouse, data quality, data cleaning, approximately duplicated records, record matching
PDF Full Text Request
Related items