Font Size: a A A

Research On Key Technologies For Data Extracting In Data Warehousing

Posted on:2004-05-23Degree:MasterType:Thesis
Country:ChinaCandidate:J J MiaoFull Text:PDF
GTID:2168360152457104Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
The core of building data warehouse, Data Extracting process, integrated in accordance with uniform rules and enhancing the value of data is responsible for the conversion of data from data source to target data warehouse, and acts as an important means to implement data warehouse. This paper intends to illustrate the data extracting system design, with focus on two key technologies in data extracting, namely, incremental data extracting and duplicate record detecting.In data source incremental data extracting process, the system design was formulated with reference to WHIPS project by Stanford University database group. Several snapshot difference algorithms were employed to comprehend and master their application scope, processing speed and accuracy. As far as the data source with log system support, the content of the log records in Oracle, SQL Server was analyzed and the procedure to extract incremental data was proposed. We meliorate the implement of data source monitor as following: 1. The monitored objects are changed from base tables to source views as to avoid propagating the unnecessary source data change; 2. We provide the monitor rules according to which monitor detects, analyzes and propagates the change in source data, and these rules can be predefined to meet the complicated demands of integration-end such as the monitoring period.Detecting database records that are approximate duplicates, but not exact duplicates, is an important task. In this paper, we investigate the problem of detecting duplications based on their structural features, and then we present an efficient and effective algorithm for recognizing clusters of approximately duplicate records. The conditional probability distribution (CPD) of the next symbol given a preceding segment is derived and used to characterize sequence record and to support the distance measure. A variation of the suffix tree, namely probabilistic suffix tree, is employed to organize the CPD in a concise way. And based on the near neighbors' rules, we select a rule function to comment the clustering results. Finally, dynamic clustering algorithm is employed to cluster the dataset. Comprehensive experiments on synthetic database records confirm the effectiveness of the new algorithm.
Keywords/Search Tags:Data Extracting, Snapshot Difference, Distance Between Two Strings, Approximately Duplicated Records, Dynamic Clustering, Probabilistic Suffix Tree
PDF Full Text Request
Related items