Font Size: a A A

Mmdb Data Matching Techniques In The Data Cleaning Process

Posted on:2008-03-17Degree:MasterType:Thesis
Country:ChinaCandidate:X Y ShiFull Text:PDF
GTID:2208360212999961Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Date cleaning consists of the five courses which including information matching,detection,correction of error information and inconsistent information .The purpose of data cleaning is to improve the data quality. The problems of data quality usually appear among several data-sets. Some'dirty data'may emerge at those data-sets, because of data misinput,diversity between databases and the discriminating data format. Moreover, those'dirty data'baffle the correctness,validity,efficiency of utilization while the data is being processed."Data cleaning"transform the"dirty data"into qualified"clean data"(the correct date) by using the means of data-statistics,data-mining and pre-defining expert database. Thus, data cleaning plays a growing important role in the mass data management and maintenance in the fields of Telecommunication and Banks.Data cleaning process consists of the following five main stages: data decomposing, standardization of data format, data matching, data correcting, outcome evaluating.The paper focuses on the problem of data matching among mass data. To a certain degree, data matching means the precise query between the database records. Traditional optimization about database query focused on how to reduce the I/O times between disk and main memory. But when confronting mass data query in data cleaning, former DRDB (Disk-Resident Database) is no longer suitable.The development in computer hardware technology has made it possible to store the whole data of a database in memory. This caused the rapid development of Main Memory Database System (MMDB) in recent years. In MMDB.This technology stores data in main memory, thus avoids a great deal of I/O operations when the query is executed. In this way, the query executing time may be shortened.In MMDB, there is no I/O operation. So the key of improving matching efficiency focuses on the CPU execution time and the cache validity. There are many ways to solve the problem. One method is to build an appropriate data index structure in order to reduce the match miss in query and shorten the CPU execution time. This paper first illuminates some kinds of index structures in common use. Then it presents a new index structure MDB-tree and the arithmetic of exact query and insertion, according to particularity of data matching. Using the cache and TLB misses model and the execution time model, a comparison between MDB-tree and common index structure will be carried out. By analyzing the results, a conclusion that the MDB-tree overcomes the shortages of traditional query structure--low fanout, poor cache behavior, and excessive utilization of pointers can be reached. It enhances the efficiency of data matching.
Keywords/Search Tags:data cleaning, data matching, MDB-tree
PDF Full Text Request
Related items