Font Size: a A A

Research On The Key Technologies Of Distributed Big Data Consistency Management

Posted on:2018-05-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:W B LiFull Text:PDF
GTID:1368330563995798Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The continuous development of new technologies and rapid growth of data have accelerated the arrival of the big data Era.Statistical data shows that the amount of data generated by the people every day has reached Exa Byte level,and the scale is constantly increasing.The size of data is too large for the traditional data management technology to analyze and deal with.It is necessary to process these data on a large-scale server clusters in parallel.Big data does not only mean that the volume of data is large,it also means that the value of the data is very large.For example,the full use and excavation of large data will have an important impact on enterprise decision-making and future development.In order to give full play to the value of big data,it usually has certain requirements for its data quality.High quality is the basis and premise of the full play of the effectiveness of big data.The characteristics of big data quality include consistency,accuracy,etc.,where consistency is an important part of big data quality issues.In the presence of data inconsistency,to find out the implicit constraint rules,it is necessary to discover implicit constraint rules such as functional dependencies from data.To improve data quality,it is necessary to detect inconsistencies from data,and find inconsistent data that violates the rules of constraint.In this paper we study the problem related to the consistency of big data quality mainly.We focus on the constraint rules discovery from big data,inconsistency detection of big data and automatic data repairing.Through the study of the consistency problem of big data,we hope to find out the constraint rules and data that violates function dependencies,which provides the basis for the repair of big data.The data repairing based on statistical learning theory is studied and an automatic data repairing approach is proposed.The main research contents and contribution can be summarized as follows:(1)Distributed big data functional dependencies discovery methods.It pinpointed the challenges of functional dependencies discovery and the shortcomings of existing functional dependency discovery methods,and functional dependencies discovery approaches are proposed suitable for distributed horizontally and vertically segmented big data.The searching policy for candidate functional dependencies is proposed,and the cost model of response time for functional dependencies discovery is proposed,too.The load distribution problem is formulated as an integer programming problem and the approximate optimal solution is presented.The pruning policies for functional dependencies discovery problem is proposed.The local result of the verification of candidate functional dependencies at each site was used to pass messages and prune the candidate functional dependencies set,which helps improve the efficiency of functional dependencies discovery.Experimental results on real and artificial datasets show that the approaches proposed in this paper have good performances in the scalability of the number of sites,volume of data and the number of attributes.(2)Distributed big data approximate functional dependencies discovery method.It pointed out the applications of approximate functional dependencies,the challenges of approximate functional dependencies discovery and the situation and shortcomings of current research.An approach for discovering approximate functional dependencies from horizontally segmented distributed big data was proposed.The searching policies for approximate functional dependencies discovery from horizontally fragmented distributed big data was presented.To improve the efficiency of approximate functional dependencies discovery,the pruning policies were proposed.The staged results were used to prune the candidate approximate functional dependencies set,and the pruning effect was quantitatively analyzed.As the tasks assignment problem is NP-hard,an approximately optimal tasks allocation method was proposed.Experimental results show that the proposed method for discovering approximate functional dependencies from distributed big data sacles better in the volume of data and the number of sites compared with centralized method.(3)Inconsistency detection methods for distributed big data.It pointed out that existing detection methods are only suitable for centralized data and with low efficiency.To improve the efficiency of inconsistency detection on big data,approaches for detecting violations of single functional dependency and multiple functional dependencies from distributed data are proposed.The correctness of the results and parallel execution of the algorithms are ensured by data redistribution with hash function.Due to the fact that the problem of inconsistency detection from distributed data is NP-hard,approximation optimal solution is proposed.For the problem of multiple functional dependencies inconsistency detection,grouping and batch detection in parallel are performed according to the structural features of functional dependencies,and the optimization problem of grouping is studied.A universal algorithm is proposed for detecting violations of multiple functional dependencies from distributed data based on equivalence classes.The response time cost model and the approximate optimal algorithm for task allocation are proposed.The dynamic load balancing problem is formulated as the two programming problem,and the Lagrange operator method is used to obtain the approximate optimal solution.The dynamic load balancing problem is formulated as quadratic programming problem,and the Lagrange multiplier method is used to obtain the approximate optimal solution.According to the experimental results,the methods proposed scale well on the volume of data,the number of sites and the number of functional dependencies,and have obvious advantages in reducing the response time.(4)Statistical learning based automatic data cleaning method.We analyzed the limitations of existing data cleaning methods and the challence of data cleaning,and proposed an automatic data cleaning method based on statistical learning and probabilistic inference.The presented method is suitalble for data cleaning without either existing data quality patterns/rules or involvement of human expert.We learned the data model from data or sampled data and transformed the data model into first-order logic formulas.The weights of first-order logic formulas were calculated,then we tranformed the first-order logic formulas into the inference rules on Deep Dive.The inference results of rules on Deep Dive platform were used to repair error data.Experimental results on real and artificial datasets show that the proposed method outperformed the existing Bayesian data cleaning method on accuracy,recall rate and F-value.
Keywords/Search Tags:Distributed data, Big data, Inconsistency, Violation detection, Knowledge discovery, Data cleaning, Data quality
PDF Full Text Request
Related items