Study And Application Of The Data Cleansing Techenology In ETL

Posted on:2008-08-15

Degree:Master

Type:Thesis

Country:China

Candidate:Z Liu

Full Text:PDF

GTID:2178360212483661

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The extraction, transform and loading (ETL) is an important step to construct data warehouse system, which made the multiple dispersed data of the organizations loaded into data warehouse according to some subject, so the data consistency and information integration of organization could be resolved. However, many dirty data maybe produced with the frequently running of ETL program, the correct analysis results may not be obtained from DW because of the data quality, so the data cleansing step must be needed before data is loaded into DW. The technique of data cleansing is a hot issue in data warehouse domain, which the main function is to eliminate inconsistent and error data from the initial data sets.After introducing the basic concepts, estimation target and categorization of data quality, the dirty data is divided into two categories, independency one and dependency one according to the data cleansing algorithm, and the related methods are proposed. The basic concept and steps of data cleansing are described, the data cleansing model in ETL process is defined, and the cleansing rule stored in meta- database is discussed, then a combined data cleansing strategy using automatic and manual methods is proposed.Aiming at the issue of Chinese address information cleansing, the segment method and algorithm based on feature word are proposed, in which the Chinese address information is segment into five fields, such as province, city, area, street and number. Matching with the standard information of Chinese address in meta-database, the segment accuracy can be ensured.In order to eliminate the approximately duplicated record of Chinese address information, the meta-database of segment rules is established. An approximately duplicated detection model and a computation algorithm according to the variable weight strategy are proposed. The experiment results indicate that this strategy can detect approximately duplicated records of Chinese address information effectually, and the algorithm running efficiency and detect precision can be improved.

Keywords/Search Tags:

ETL, Data cleansing, Approximately duplicated records, Feature word, Segment, Variable weight

PDF Full Text Request

Related items

1	Research On Data Cleaning Of Approximately Duplicated Records
2	Research And Application Of Data Cleansing In Multi-radar Data Fusion Algorithm
3	The Research And Application Of Duplicated Records And Incomplete Data's Cleaning Approach
4	Research On Detection Of Approximate Duplicate Records For Massive Data
5	An Improved Method For Detecting Incremental Approximately Duplicate Records Based On Clustering Tree
6	Research On The Method Of Approximately Duplicated Records Detection For Text Data In Big Data Envitonment
7	Some Main Technology's Research Of Data Cleaning
8	Research On Key Technologies For Data Extracting In Data Warehousing
9	Data Cleaning Algorithm And Applications
10	Research Of Data Cleaning Method Based On Data Warehouse