Research On Key Technologies For Data Extracting In Data Warehousing

Posted on:2004-05-23

Degree:Master

Type:Thesis

Country:China

Candidate:J J Miao

Full Text:PDF

GTID:2168360152457104

Subject:Management Science and Engineering

Abstract/Summary:

PDF Full Text Request

The core of building data warehouse, Data Extracting process, integrated in accordance with uniform rules and enhancing the value of data is responsible for the conversion of data from data source to target data warehouse, and acts as an important means to implement data warehouse. This paper intends to illustrate the data extracting system design, with focus on two key technologies in data extracting, namely, incremental data extracting and duplicate record detecting.In data source incremental data extracting process, the system design was formulated with reference to WHIPS project by Stanford University database group. Several snapshot difference algorithms were employed to comprehend and master their application scope, processing speed and accuracy. As far as the data source with log system support, the content of the log records in Oracle, SQL Server was analyzed and the procedure to extract incremental data was proposed. We meliorate the implement of data source monitor as following: 1. The monitored objects are changed from base tables to source views as to avoid propagating the unnecessary source data change; 2. We provide the monitor rules according to which monitor detects, analyzes and propagates the change in source data, and these rules can be predefined to meet the complicated demands of integration-end such as the monitoring period.Detecting database records that are approximate duplicates, but not exact duplicates, is an important task. In this paper, we investigate the problem of detecting duplications based on their structural features, and then we present an efficient and effective algorithm for recognizing clusters of approximately duplicate records. The conditional probability distribution (CPD) of the next symbol given a preceding segment is derived and used to characterize sequence record and to support the distance measure. A variation of the suffix tree, namely probabilistic suffix tree, is employed to organize the CPD in a concise way. And based on the near neighbors' rules, we select a rule function to comment the clustering results. Finally, dynamic clustering algorithm is employed to cluster the dataset. Comprehensive experiments on synthetic database records confirm the effectiveness of the new algorithm.

Keywords/Search Tags:

Data Extracting, Snapshot Difference, Distance Between Two Strings, Approximately Duplicated Records, Dynamic Clustering, Probabilistic Suffix Tree

PDF Full Text Request

Related items

1	Research On Data Cleaning Of Approximately Duplicated Records
2	An Improved Method For Detecting Incremental Approximately Duplicate Records Based On Clustering Tree
3	The Research And Application Of Duplicated Records And Incomplete Data's Cleaning Approach
4	Research On Detection Of Approximate Duplicate Records For Massive Data
5	Some Main Technology's Research Of Data Cleaning
6	Research On The Method Of Approximately Duplicated Records Detection For Text Data In Big Data Envitonment
7	Study And Application Of The Data Cleansing Techenology In ETL
8	Web Information Extracting Based On Tree Edit Distance
9	Study On ETL Technology Based On XML Data Resouces
10	Research Of Data Cleaning Method Based On Data Warehouse