Font Size: a A A

Research On Signature-based Data Consistency Evaluation Technology

Posted on:2021-04-02Degree:MasterType:Thesis
Country:ChinaCandidate:M HuangFull Text:PDF
GTID:2438330602498311Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of the information society,information as a strategic resource and production factor has become the basis for the normal operation of society and the lifeline of enterprises.However,the current level of information or data quality is not optimistic.The world spends a lot of money every year on ensuring data quality or solving problems with erroneous or inaccurate data(poor quality).Data quality issues have increasingly attracted more and more attention from government departments,research institutions and enterprises.Without good data quality,using wrong data to make decisions can easily lead to negative results.The inconsistency of the data will cause the quality of the data to decline,resulting in improper decision-making and significant losses.Therefore,in order to support making correct decisions and obtaining more benefits,we need to evaluate the consistency of the data.Currently,for the problem of consistency evaluation,the mainstream method is to evaluate the data set through semantic consistency.However,in this era of explosion of data volume,there are still many challenges in evaluating the consistency of the data set.The biggest challenge is that in order to get accurate consistency assessment results,we need to visit the relevant data sources one by one.Because the number of data sources is very large,accessing all data sources to assess consistency will result in high costs.In order to solve this problem,we propose not to directly access the data source to evaluate the consistency of the data set.In order to effectively evaluate the consistency of the data set and improve the efficiency of the data consistency evaluation,this paper studies the consistency of the evaluation target data set from the perspective of multiple data sources.The main research results are as follows:1.This paper designs a data consistency measurement and evaluation framework.On the issue of consistency assessment,this article studies the consistency assessment under multiple data sources,proposes data consistency measurement and assessment measures under multiple data sources,and gives A precise algorithm for consistency assessment is provided;2.This paper proposes a consistency evaluation algorithm based on functional dependence(FD).First,we use the minimum hash technology to design an effective signature,which is much smaller than the original data source.Next Based on the proposed consistency evaluation framework,we use the generated signatures to evaluate the consistency of the data set.Finally,we conducted a comparative experiment on two real data sets,and the experimental results confirmed the correctness and effectiveness of the algorithm;3.This paper proposes a consistency evaluation algorithm based on matching dependence(MD).Since the matching dependence involves fuzzy matching,we designed a two-layer signature of the min-hashing-based to match the dependence from multiple sources under matching dependencies to evaluates the consistency of the data set.First,we construct the first layer signature,and then use the minimum hash technology to design the first layer signature and compress the data again,so as to use the data signature to effectively assess the consistency of the data source.Finally,we experiments on two real-word data set,and the experimental results show that our algorithm is fast and effective.
Keywords/Search Tags:Data quality, completeness, record matching rule, signature
PDF Full Text Request
Related items