Font Size: a A A

Research On The Method Of Approximately Duplicated Records Detection For Text Data In Big Data Envitonment

Posted on:2019-01-04Degree:MasterType:Thesis
Country:ChinaCandidate:T YuFull Text:PDF
GTID:2348330545992079Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The effective detection of similar duplicate records is the key to ensure data quality,and it is also an important guarantee for obtaining reliable decision-making knowledge.However,with the development of intelligence among industrial system,the scale of data has expanded exponentially,what's more,most of the similar duplicate data exists as text in this rapidly growing large-scale data.Therefore,the detection of similar duplicate records of text data under big data environment is of great significance to improve data quality.This paper converts the detection of similar duplicate records of text data to the detection of similar repeatability of its binary strings,and the traditional Simhash algorithm has this function,but the conversion accuracy between text data and Simhash fingerprints(binary strings)is low and the similarity matches among the Simhash fingerprints is inefficient.Therefore,this paper introduces the Missing Data Filling Method based on the Neville interpolation algorithm to fill in the missing data in the original data firstly.Then the Institute of Computing Technology Chinese Lexical Analysis System is used to extract keywords from the filled data records,and the Team Frequency-Inverse Document Frequency is used to calculate the weight of those keywords.In this way,the conversion accuracy of text data and Simhash fingerprints is improved.Secondly,the Fingerprint Classification Strategy based on graph cluster analysis is designed,and the Hamming distance is introduced to solve the problem that the inefficience of fingerprints similarity matching.Finally,a Similar Record Detection Algorithm(SRDA)is proposed based on the improved Simhash to realize the conversion of text data into Simhash fingerprints,proceeding to realize the detection of similar duplicate records of text data.The large scale of text data in the big data environment makes it difficult for a single-machine computing resource to meet its processing requirements.Therefore,for the problem of similar duplicate records detection in large-scale text data,a method based on MapReduce model is proposed to detect similar duplicate records for text data.Firstly,A Simhash fingerprint inversion indexing algorithm based on Dirichlet's drawer principle is designed,at the same time,the SRDA is optimized by using the algorithm so as to avoid the sequential comparison among Simbash fingerprints according to per bit.Finally,a new parallel algorithm based on MapReduce model and the optimized algorithm is designed,thus a parallel detection of similar duplicate records for large-scale text data is realized.The text data in big data environment is generated at a high speed,hence a high-response processing technology is needed.Compared with Spark,MapReduce has the advantage of "high throughput",but its operation speed is relatively slow.For this reason,aiming at the problem that the fast detection of similar duplicate records for text data,a similar duplicate records detection method for text data based on Spark is proposed,Firstly,in view of the advantages of Spark memory computing,a Simhash fingerprint search strategy based on graph theory is designed.Then,a similar duplicate records detection algorithm is designed based on the SRDA,what's more,the algorithm is implemented on a Spark platform to complete the fast detection of similar duplicate records for text data.The methods proposed in this paper are tested with other related algorithms by using the data from UCI.The experimental results show that the proposed methods in this paper can accurately and objectively achieve the similar duplicate records detection of text data under a big data environment,and it has a high detection accuracy,recall rate and execution efficiency.They can provide a certain reference for the data cleaning research work in the future.
Keywords/Search Tags:Text big data, Similar duplicate records detection, MapReduce, Spark, Graph theory analysis
PDF Full Text Request
Related items