Font Size: a A A

Data Cleaning Algorithm And Applications

Posted on:2006-10-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y X ZhouFull Text:PDF
GTID:2208360152498760Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology, organizational managers depend on data more and more when making their decisions. On the foundation of database there appears data warehouse which can support decision analysis. But during the construction of data warehouse, data from different data sources are inputted into the data warehouse, there may exist many data qualitative problems, result in false decisive analysis and influent quality of information service. There is a strong need to carry out a data cleansing process to improve the data quality. Data cleansing is becoming an important topic in data warehouse and data mining, as well as web data processing fields.In this paper, we depicted the knowledge of data cleansing in detail. We introduced the concept, meaning and current research and application situation home and abroad of data cleansing. We summarized and described the theories, methods, evaluating standards and basic workflow of data cleansing. Especially our researching emphasis is on the techniques and algorithms of field cleansing and duplicate records cleansing, and we put forward the relevant advanced algorithms.In field cleansing, we simply introduced basic knowledge and methods of field cleansing. We mainly researched how to apply the techniques of statistical analysis and artificial intelligence to automatically detect errors of field value. We give our experimental results and conclusions on a real world dataset.In duplicate records cleansing, we introduced its basic knowledge and workflow, depicted the main techniques and algorithms in detail in each step respectively. At the same time we gave our advanced algorithms to improve the limitation of original ones in each step. They mainly include the following: the advanced method using sorted key to sort the dataset. In duplicate records detection, we put forward the field match algorithm and abbreviation-discovered algorithm based on edit distance. In record match, we came up with the optimized method using valid weight value and length filtering to reduce the runtime of original algorithm and improve its efficiency. In clustering the duplicate records on database level, we amended two limitations of traditional SNM (Sorted-neighborhood method) and gave the advanced SNM. At last we provided the compare of advanced and original algorithm on the runtime and efficiency.Finally, in order to resolve the data cleansing problems during the construction process of data warehouse for Qing Dao harbor bureau, we designed an experimental data...
Keywords/Search Tags:data cleansing, field cleansing, duplicate records cleansing, field match, edit distance, abbreviation discovery
PDF Full Text Request
Related items