Mmdb Data Matching Techniques In The Data Cleaning Process

Posted on:2008-03-17

Degree:Master

Type:Thesis

Country:China

Candidate:X Y Shi

Full Text:PDF

GTID:2208360212999961

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Date cleaning consists of the five courses which including information matching,detection,correction of error information and inconsistent information .The purpose of data cleaning is to improve the data quality. The problems of data quality usually appear among several data-sets. Some'dirty data'may emerge at those data-sets, because of data misinput,diversity between databases and the discriminating data format. Moreover, those'dirty data'baffle the correctness,validity,efficiency of utilization while the data is being processed."Data cleaning"transform the"dirty data"into qualified"clean data"(the correct date) by using the means of data-statistics,data-mining and pre-defining expert database. Thus, data cleaning plays a growing important role in the mass data management and maintenance in the fields of Telecommunication and Banks.Data cleaning process consists of the following five main stages: data decomposing, standardization of data format, data matching, data correcting, outcome evaluating.The paper focuses on the problem of data matching among mass data. To a certain degree, data matching means the precise query between the database records. Traditional optimization about database query focused on how to reduce the I/O times between disk and main memory. But when confronting mass data query in data cleaning, former DRDB (Disk-Resident Database) is no longer suitable.The development in computer hardware technology has made it possible to store the whole data of a database in memory. This caused the rapid development of Main Memory Database System (MMDB) in recent years. In MMDB.This technology stores data in main memory, thus avoids a great deal of I/O operations when the query is executed. In this way, the query executing time may be shortened.In MMDB, there is no I/O operation. So the key of improving matching efficiency focuses on the CPU execution time and the cache validity. There are many ways to solve the problem. One method is to build an appropriate data index structure in order to reduce the match miss in query and shorten the CPU execution time. This paper first illuminates some kinds of index structures in common use. Then it presents a new index structure MDB-tree and the arithmetic of exact query and insertion, according to particularity of data matching. Using the cache and TLB misses model and the execution time model, a comparison between MDB-tree and common index structure will be carried out. By analyzing the results, a conclusion that the MDB-tree overcomes the shortages of traditional query structure--low fanout, poor cache behavior, and excessive utilization of pointers can be reached. It enhances the efficiency of data matching.

Keywords/Search Tags:

data cleaning, data matching, MDB-tree

PDF Full Text Request

Related items

1	Research Of Key Technology In Massive Data Cleaning
2	Study Of Data Cleaning Algorithms Based On Data Warehouse
3	Research On Data Cleaning Based On Science And Technology Innovation Big Data Public Platform
4	Research Of Data Cleaning Method Based On Data Warehouse
5	Research On Data Cleaning Technology With The Design And Implementation Of Data Cleaning Framework
6	Design And Implementation Of Customer Information Cleaning In CRM System
7	Based On Spatial-temporal Correlation Sensory Data Cleaning Research
8	Research On Generating Matching Rules In Entity Matching
9	Research On Key Technologies Of On-demand Cleaning For Dirty Data
10	Key Techniques Of Structured Data Cleaning