Font Size: a A A

Reseaerch On Detection And Repair Of Structure Data Availability Violation

Posted on:2019-11-09Degree:MasterType:Thesis
Country:ChinaCandidate:X Z ZhuFull Text:PDF
GTID:2428330566998086Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
At present,the IT industry is developing rapidly,and a large amount of information data has been accumulated in various fields.The structured relational data model occupies most of the data volume because it is proposed earlier and easier to understand.With the rapid growth of the volume of data,low-quality data has also come along with it,which has seriously affected the availability of data and caused many adverse consequences.Therefore,the availability of big data has been widely studied in academic and industrial circles in recent years.Data consistency is an important sub-property of data availability.Improper design of data models and integration of multiple data sources can all lead to data inconsistencies.Condition function dependency is a mechanism of data consistency expression based on semantic rules,which is of great significance for the detection and repair of data consistency violations.In this paper,the lack of conditional function dependence on the ability of data consistency is presented.This paper proposes the semantic extension of the dependence of conditional functions,enhances the expression ability on the rule constraints,and based on the error correction and error correction strategies in existing theories.A corresponding SQL query and a procedure-based detection and repair program were proposed.The subsequent experiments show that the detection and repair algorithm has high error detection rate and error correction rate,and has practical feasibility.In addition,the enhanced conditional function dependence proposed in this paper can also express the existing conditional function dependence and extended conditional function dependency rules,so it has good compatibility with existing theories.At the same time,high-quality data is defined by function dependence,so that high-quality data is integrated into the theory system of condition function dependence,and the guiding significance of high-quality data for data error detection and repair is defined.Data integrity is another important sub-property in data availability.Manual input errors,missing null constraints,and attribute identification of semi-structured data can all result in missing data integrity.In the detection and repair of data integrity violations,an improved k-NN algorithm is used.In distance metrics,this paper calculates the normalized distances for numerical,classification,and text data types respectively,so as to ensure the rationality of inter-tuple distance metrics.On the selection of k values,the mean value of k is evaluated.The method of dynamically selecting the k parameter value.The experimental results show that this method has a good filling effect.In order to detect and repair data errors in big data in a distributed parallel computing environment,this paper proposes algorithms in the Map Reduce programming framework,including detection and repair algorithms for data consistency violation based on enhanced conditional function dependency.Map Join and Reduce Join algorithms based on master data and k NN-based integrity violation detection and repair algorithms.Then a series of experiments to illustrate the good effect of the algorithm was designed and implemented.
Keywords/Search Tags:data quality, data availability, data consistency, data integrity, conditional function dependency
PDF Full Text Request
Related items