Data Cleaning Algorithm And Applications

Posted on:2006-10-21

Degree:Master

Type:Thesis

Country:China

Candidate:Y X Zhou

Full Text:PDF

GTID:2208360152498760

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid development of information technology, organizational managers depend on data more and more when making their decisions. On the foundation of database there appears data warehouse which can support decision analysis. But during the construction of data warehouse, data from different data sources are inputted into the data warehouse, there may exist many data qualitative problems, result in false decisive analysis and influent quality of information service. There is a strong need to carry out a data cleansing process to improve the data quality. Data cleansing is becoming an important topic in data warehouse and data mining, as well as web data processing fields.In this paper, we depicted the knowledge of data cleansing in detail. We introduced the concept, meaning and current research and application situation home and abroad of data cleansing. We summarized and described the theories, methods, evaluating standards and basic workflow of data cleansing. Especially our researching emphasis is on the techniques and algorithms of field cleansing and duplicate records cleansing, and we put forward the relevant advanced algorithms.In field cleansing, we simply introduced basic knowledge and methods of field cleansing. We mainly researched how to apply the techniques of statistical analysis and artificial intelligence to automatically detect errors of field value. We give our experimental results and conclusions on a real world dataset.In duplicate records cleansing, we introduced its basic knowledge and workflow, depicted the main techniques and algorithms in detail in each step respectively. At the same time we gave our advanced algorithms to improve the limitation of original ones in each step. They mainly include the following: the advanced method using sorted key to sort the dataset. In duplicate records detection, we put forward the field match algorithm and abbreviation-discovered algorithm based on edit distance. In record match, we came up with the optimized method using valid weight value and length filtering to reduce the runtime of original algorithm and improve its efficiency. In clustering the duplicate records on database level, we amended two limitations of traditional SNM (Sorted-neighborhood method) and gave the advanced SNM. At last we provided the compare of advanced and original algorithm on the runtime and efficiency.Finally, in order to resolve the data cleansing problems during the construction process of data warehouse for Qing Dao harbor bureau, we designed an experimental data...

Keywords/Search Tags:

data cleansing, field cleansing, duplicate records cleansing, field match, edit distance, abbreviation discovery

PDF Full Text Request

Related items

1	Research And Implementation On Mass Data Cleaning In E-Government System
2	Study And Application Of The Data Cleansing Techenology In ETL
3	Research And Application Of Data Cleansing In Multi-radar Data Fusion Algorithm
4	Research Of Data Cleansing Algorithms For Duplicate Records Detection Problem
5	Research And Implementation Of Data Cleansing Framework Based On Component
6	The Research Of Data Cleansing With XML
7	Data Bryte: A standards/model-based data cleansing framework
8	Study And Implementation Of A Data Cleansing System Based On Multi-agent Technology
9	Study And Implementation Of A Data Cleansing System Based On Multi-Agent Technology
10	Duplicates Cleansing Based On Semantic Association