Font Size: a A A

Research Of Data Cleansing Algorithms For Duplicate Records Detection Problem

Posted on:2019-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:P ZhangFull Text:PDF
GTID:2428330572955932Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the development of the information technology and the information construction,the size of the data becomes larger and larger.Variety of dirty data are inevitable,such as wrong data,duplicate data and half-baked data and so on.As a result,effective algorithms are necessary for data cleaning.The duplicate records detection problem is one of the most important problem in data cleaning.In this paper,we have researched and improved the algorithms for the duplicate records detection problem.The duplicate records detection problem is to find the duplicate records for a given dataset.In real world,it's difficult to design effective algorithms for the problem since the large size and the different sources of the data.Although there are some algorithms for solving this problem,such as the Sorted-Neighborhood Method(SNM)and the Multi-Pass Sorted-Neighborhood Method(MPN),they all have shortcomings when tackle the real-world duplicate records detection problems.The effectiveness of the SNM and the MPN relies on the expert knowledge of the dataset.So it's hard to solve dataset without priori knowledge.With the goal of overcoming the shortcomings of the SNM and the MPN,we proposed the Optimized Multi-Pass Sorted-Neighborhood Method(OMPN).In addition,we make a combination of the OMPN and the genetic-based artificial neural network to solve the problem and propose the Advanced and Optimized Multi-Pass Sorted-Neighborhood Method(A-OMPN)and the BP network based Multi-Pass Sorted-Neighborhood Method(BP-OMPN).The A-OMPN and the BP-OMPN are superior to the other algorithms.Finally,we apply the proposed algorithm to the spacecraft information management system to accomplish the data-cleaning in the real-world problem.The main contributions of this paper are as follows:1.The Optimized Multi-Pass Sorted-Neighborhood Method(OMPN)is proposed.The MPN first sort all the records and then use a scale-fixed sliding window to check the duplicate records.However,it needs the expert knowledge to select the key and to detect the duplicate records in a sliding window.In the OMPN,the field distinction degree basedmethod is proposed to select the key without the expert knowledge.In the meantime,the OMPN uses the scalable sliding window to make the detecting process more precise.The OMPN also takes the half-baked data into account by pre-label scheme.Compared with other algorithms,the OMPN performs well and it's suitable for solving the real-world duplicate records detection problem.2.The Advanced Multi-Pass Sorted-Neighborhood Method(A-OMPN)is proposed.The genetic-based artificial neural network that used to solve the problem should select two different records in the whole dataset to check whether they are duplicate or not.It's very time-consuming and the detecting stage can be simplified.The A-OMPN makes a combination of the genetic-based artificial neural network and the OMPN to select records only in a sliding window.It can not only improve the precision ratio and the recall ratio but also reduce the runtime compared with the genetic-based artificial neural network.However,to train an appropriate genetic-based artificial neural network is still time-consuming.We also do experiments with the single BP network and then generate the BP network based Multi-Pass Sorted-Neighborhood Method(BP-OMPN).Experimental results show that the A-OMPN and the BP-OMPN all perform well.3.We apply the proposed algorithm to the spacecraft information management system.The data cleaning module is one of the most important modules in this system.We do analysis of the OMPN,the A-OMPN and the A-OMPN with the given aerospace craft dataset.Finally,we choose the OMPN to accomplish this module.
Keywords/Search Tags:Duplicate Records Detection, Data Cleaning, Sorted-Neighborhood, Neural Network, Genetic Algorithm
PDF Full Text Request
Related items