Font Size: a A A

Research On Signature-based Data Integrity Evaluation Technology

Posted on:2021-02-16Degree:MasterType:Thesis
Country:ChinaCandidate:A M WuFull Text:PDF
GTID:2438330602998338Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the increasing amount of data,the availability of big data has continued to decline,and the importance of data quality issues has become increasingly prominent.Low-quality data often leads to misleading analysis results and biased decisions,and leads to revenue,reputation,and loss of customers.In order to make the data meet the needs of users under different operations,research on data quality has become an important task in the field of data management.Completeness is one of the core standards for measuring data quality.Data completeness refers to the completeness of the data relative to the objective world described.Data sets with high completeness can help companies conduct reputation assessment,result analysis and decision-making.Therefore,the completeness assessment of the data is very important to identify high-quality data sets.However,in the big data integration environment,there are many challenges in evaluating the completeness of the data set.First,to get an accurate assessment of reco rd completeness,we need to access record s from all data sources.Under the current big data background,this will bring huge time cost,which is unrealistic.Secondly,in the actual situation,many data sets do not have uniquely identified record IDs.In addition,records describing the same entity may have some inconsistencies in data descriptions in different data sets,which makes it difficult to determine record refers to the same entity in the real world.Third,in many studies,there is a lack of a unified data completeness evaluation model,and the record completeness evaluation coefficient of the data set cannot be given.This paper considers studying the record completeness of the target dataset from the perspective of multiple data sources.Based on the relevant theories and technologies of data compression,this paper also considers the completeness evaluation based on record ID and record matching rules.The main research results are as follows:1.This paper studies record completeness evaluation based on record ID.A record completeness evaluation of the data set from the perspective of multiple data sources is proposed,and a random algorithm for constructing signatures on the data sources is designed using data compression.The algorithm evaluates the record completeness of the target dataset witho ut directly accessing the record ID,and analyzes the effectiveness and efficiency of the algorithm.Experimental results on real data sets and synthetic data sets show that the proposed random algorithm can effectively evaluate the integrity of the target data set,and is significantly better than the precise algorithm used for comparison in terms of efficiency and performance.2.This paper studies record completeness evaluation based on record matching rules.When there is no unique identifier in the data set,use the record matching rules for entity recognition.First,a data completeness evaluation model based on record matching rules is defined.Secondly,a two-level random algorithm is designed that uses the record matching rules to construct the first layer signature at the record layer and the second layer signature at the data set layer from the obtained first layer signature set.Experimental results on real data sets show that the two-level random algorithm has good scalability and can effectively evaluate the completeness score of the target data set based on record matching rules.
Keywords/Search Tags:Data quality, Completeness, Record matching rule, Signature
PDF Full Text Request
Related items