Research On Related Algorithms For Chinese Repeated Record Cleaning

Posted on:2019-02-13

Degree:Master

Type:Thesis

Country:China

Candidate:F Wang

Full Text:PDF

GTID:2438330566990186

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

In this era of information explosion,more and more data problems follow.One of the most critical problems is the duplication of data caused by merging multiple data sources.For erroneous " dirty data",Enterprises in the data analysis,there will be biased,serious cases will make wrong judgments,and may eventually lead to wrong decisions,and even become a prominent problem restricting the development of all walks of life.Therefore,how to clean duplicate data has become a hot researce.At present,in the field of data cleaning,most of the target objects for cleaning repetitive data technology are English text,while the processing technology for Chinese text is less and the domestic related research results are not much.In this paper,Chinese text data is taken as the research object,and the repeated data is cleaned.The main tasks are as follows:(1)The Edit distance algorithm used in the process of Chinese similar duplicate record detection is studied,and the shortcomings of the algorithm in Chinese recording are analyzed,and the editing distance algorithm is improved accordingly.After experimental verification,it is proved that the improved algorithm improves the accuracy of matching similar duplicate records.(2)The shortcomings of closure transfer method used in the process of duplicate records merging are analyzed,and a new method of duplicate records merging based on graph maximum relevance degree is proposed.This method can avoid the false operation between dissimilar records caused by data transfer.The experimental results show that the new method can obviously improve the accuracy of similar duplicate records merging.(3)Since the above merging method requires that similar records can be effectively clustered,SNM algorithm adopted in the process of cleaning record sets is improved,and the window size is determined by the number of records after clustering,so as to replace the original fixed window size and change the original sliding window.Through the experimental analysis,the improved algorithm can reduce the time-consuming as much as possible and improve the cleaning accuracy.

Keywords/Search Tags:

Data cleaning, Chinese duplicate record, Edit distance algorithm, SNM algorithm

PDF Full Text Request

Related items

1	Research On Technologies Of Duplicate Record Data Cleaning In Big Data Environment
2	Research Of Large Amount Of Data In Chinese Commodity Cleaning Method Of The Algorithm Based On The SNM
3	Data Cleaning Algorithm And Applications
4	Fuzzy Matching Based On Edit Distance Algorithm Of Chinese Technology In The Environment Of A Large Amount Of Data
5	Research On Technologies Of Duplicate Record Data Cleaning Under Industrial Big Data
6	Research And Application Of Data Cleaning In The Construction Of POI Data Warehouse
7	Research On Sorted-neighborhood Method And Its Application In Chinese Data Cleaning
8	Research On DBSCAN-Based Detection Method Of Approximate Duplicate Records
9	Research And Implementation Of Web Data Storage And Data Cleaning Technology Based On XML
10	Research Of Methods Of Data Cleaning For Hotel Entity Based On Edit Distance And Conditional Functional Dependencies