Font Size: a A A

Research On Related Algorithms For Chinese Repeated Record Cleaning

Posted on:2019-02-13Degree:MasterType:Thesis
Country:ChinaCandidate:F WangFull Text:PDF
GTID:2438330566990186Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In this era of information explosion,more and more data problems follow.One of the most critical problems is the duplication of data caused by merging multiple data sources.For erroneous " dirty data",Enterprises in the data analysis,there will be biased,serious cases will make wrong judgments,and may eventually lead to wrong decisions,and even become a prominent problem restricting the development of all walks of life.Therefore,how to clean duplicate data has become a hot researce.At present,in the field of data cleaning,most of the target objects for cleaning repetitive data technology are English text,while the processing technology for Chinese text is less and the domestic related research results are not much.In this paper,Chinese text data is taken as the research object,and the repeated data is cleaned.The main tasks are as follows:(1)The Edit distance algorithm used in the process of Chinese similar duplicate record detection is studied,and the shortcomings of the algorithm in Chinese recording are analyzed,and the editing distance algorithm is improved accordingly.After experimental verification,it is proved that the improved algorithm improves the accuracy of matching similar duplicate records.(2)The shortcomings of closure transfer method used in the process of duplicate records merging are analyzed,and a new method of duplicate records merging based on graph maximum relevance degree is proposed.This method can avoid the false operation between dissimilar records caused by data transfer.The experimental results show that the new method can obviously improve the accuracy of similar duplicate records merging.(3)Since the above merging method requires that similar records can be effectively clustered,SNM algorithm adopted in the process of cleaning record sets is improved,and the window size is determined by the number of records after clustering,so as to replace the original fixed window size and change the original sliding window.Through the experimental analysis,the improved algorithm can reduce the time-consuming as much as possible and improve the cleaning accuracy.
Keywords/Search Tags:Data cleaning, Chinese duplicate record, Edit distance algorithm, SNM algorithm
PDF Full Text Request
Related items