Font Size: a A A

Research On Duplicate Detection Of Data Quality In Big Data

Posted on:2018-05-28Degree:MasterType:Thesis
Country:ChinaCandidate:K HuFull Text:PDF
GTID:2428330572450786Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In the era of big data,data become valuable company assets.The rational analysis and mining of enterprise data assets can provide a reasonable basis for the management,control and scientific decision-making of enterprises,and reduce and eliminate risks in enterprise economic activities.Companies or companies often need high accurate data in order to make better decisions.These "dirty data" lead to wrong analysis results,which can affect decision making.(1)In order to reduce the detection cost and improve the operation efficiency,based on the traditional window technology and block technology,this paper presents a kind of similar duplicate record detection algorithm.In this algorithm,the data sets are sorted and partitioned by using the key fields,and the sliding window technique is used to limit the inter block comparison.And based on this,we design a kind of improved algorithm of multi sort field,and improve the algorithm to cluster the different fields together.The improved algorithm can reduce the number of data in the process of detection and reduce the influence of the field on the speed of the algorithm.Theoretical analysis and experimental results show that the algorithm can effectively improve the accuracy and time efficiency of similar duplicate records detection.(2)For the similar duplicate records detection of massive data sources,the Map Reduce model is used to carry on the parallel transformation.The data set is cut into slices,which makes the algorithm have high speed and parallel processing ability.Theoretical analysis and experimental results show that the algorithm speed of duplicate detection,does not reduce the original algorithm recall and the precision of two indicators.(3)Based on the duplicate detection process understanding,analysis and problem of common data,set up and developed the data only,quality inspection tools,this tool is used to analysis the uniqueness of data detection,data quality analysis to help enterprises understand,establish the auxiliary business system application capability maturity evaluation.It can provide effective help for the problems and deficiencies of the mining application system,and forecast the operation status and the key points of the future operation of the national grid.
Keywords/Search Tags:Duplicate detection, Data quality, Capability maturity model, Data uniqueness
PDF Full Text Request
Related items