Research On Data Cleaning Of Approximately Duplicated Records

Posted on:2017-12-12

Degree:Master

Type:Thesis

Country:China

Candidate:K X Wang

Full Text:PDF

GTID:2348330488966015

Subject:Computer application technology

Abstract/Summary:

With the continuous development of network technology and the extensive application of data storage technology,every day a lot of data is generated.There are many erroneous data in this large amount of data,especially when the process of database combing,a large number of approximately duplicated data were build and impact on the data quality and data usage seriously.Therefore,the research on the data cleaning of approximately duplicated data has become an urgent problem.It has important practical significance to improve the data quality.This thesis mainly studies the data cleaning of approximately duplicated data,firstly,analyzed the property structure and cause of approximately duplicated data,then,calculated the data record by using the N-Gram algorithm and get the key value that can represents the attribute of each record.Ordering the data in the database according the key by using the cleaning ideas of arrangement complicated and get the ordered database,then calculate the similarity degree of them.In order to improve the accuracy and efficiency of similarity matching,by using the algorithm of sliding window,given a fixed window to the data which to be cleaned,and through the recursive field matching algorithm,calculate similarity degree of the data record which in the window.not only can identify the approximately duplicated records with missing fields,reversed order and shorthand term,improve the accuracy of data cleaning,but also can reduce the number of comparisons between data records and improve the efficiency of data cleaning.Finally,through the idea of the priority,according the priority of each data record,cleaning those data records whose similarity degree greater than a given threshold,greatly improved the intelligence of data cleaning and reducing labor participation.By using N-Gram algorithm and he recursive field matching algorithm,this paper realizes the recognition function of similar duplicate records,and then cleans the data records to improve the quality of data.

Keywords/Search Tags:

data cleaning, approximately duplicated records, N-Gram, sliding window, priority

Related items

1	The Research And Application Of Duplicated Records And Incomplete Data's Cleaning Approach
2	Research On Detection Of Approximate Duplicate Records For Massive Data
3	Some Main Technology's Research Of Data Cleaning
4	Research Of Data Cleaning Method Based On Data Warehouse
5	Research And Implementation Of Data Cleaning System Based On Pre-Processing Techniques
6	Research On The Method Of Approximately Duplicated Records Detection For Text Data In Big Data Envitonment
7	Research On Multi-source Heterogeneous Large Data Cleaning Technology Based On Machine Learning
8	Research On Data Cleaning Method Based On Optimal Feature Selection
9	Study And Application Of The Data Cleansing Techenology In ETL
10	Research On Key Technologies For Data Extracting In Data Warehousing