Font Size: a A A

Research On Data Cleaning Of Approximately Duplicated Records

Posted on:2017-12-12Degree:MasterType:Thesis
Country:ChinaCandidate:K X WangFull Text:PDF
GTID:2348330488966015Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the continuous development of network technology and the extensive application of data storage technology,every day a lot of data is generated.There are many erroneous data in this large amount of data,especially when the process of database combing,a large number of approximately duplicated data were build and impact on the data quality and data usage seriously.Therefore,the research on the data cleaning of approximately duplicated data has become an urgent problem.It has important practical significance to improve the data quality.This thesis mainly studies the data cleaning of approximately duplicated data,firstly,analyzed the property structure and cause of approximately duplicated data,then,calculated the data record by using the N-Gram algorithm and get the key value that can represents the attribute of each record.Ordering the data in the database according the key by using the cleaning ideas of arrangement complicated and get the ordered database,then calculate the similarity degree of them.In order to improve the accuracy and efficiency of similarity matching,by using the algorithm of sliding window,given a fixed window to the data which to be cleaned,and through the recursive field matching algorithm,calculate similarity degree of the data record which in the window.not only can identify the approximately duplicated records with missing fields,reversed order and shorthand term,improve the accuracy of data cleaning,but also can reduce the number of comparisons between data records and improve the efficiency of data cleaning.Finally,through the idea of the priority,according the priority of each data record,cleaning those data records whose similarity degree greater than a given threshold,greatly improved the intelligence of data cleaning and reducing labor participation.By using N-Gram algorithm and he recursive field matching algorithm,this paper realizes the recognition function of similar duplicate records,and then cleans the data records to improve the quality of data.
Keywords/Search Tags:data cleaning, approximately duplicated records, N-Gram, sliding window, priority
PDF Full Text Request
Related items