Font Size: a A A

Study And Application Of The Data Cleansing Techenology In ETL

Posted on:2008-08-15Degree:MasterType:Thesis
Country:ChinaCandidate:Z LiuFull Text:PDF
GTID:2178360212483661Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The extraction, transform and loading (ETL) is an important step to construct data warehouse system, which made the multiple dispersed data of the organizations loaded into data warehouse according to some subject, so the data consistency and information integration of organization could be resolved. However, many dirty data maybe produced with the frequently running of ETL program, the correct analysis results may not be obtained from DW because of the data quality, so the data cleansing step must be needed before data is loaded into DW. The technique of data cleansing is a hot issue in data warehouse domain, which the main function is to eliminate inconsistent and error data from the initial data sets.After introducing the basic concepts, estimation target and categorization of data quality, the dirty data is divided into two categories, independency one and dependency one according to the data cleansing algorithm, and the related methods are proposed. The basic concept and steps of data cleansing are described, the data cleansing model in ETL process is defined, and the cleansing rule stored in meta- database is discussed, then a combined data cleansing strategy using automatic and manual methods is proposed.Aiming at the issue of Chinese address information cleansing, the segment method and algorithm based on feature word are proposed, in which the Chinese address information is segment into five fields, such as province, city, area, street and number. Matching with the standard information of Chinese address in meta-database, the segment accuracy can be ensured.In order to eliminate the approximately duplicated record of Chinese address information, the meta-database of segment rules is established. An approximately duplicated detection model and a computation algorithm according to the variable weight strategy are proposed. The experiment results indicate that this strategy can detect approximately duplicated records of Chinese address information effectually, and the algorithm running efficiency and detect precision can be improved.
Keywords/Search Tags:ETL, Data cleansing, Approximately duplicated records, Feature word, Segment, Variable weight
PDF Full Text Request
Related items