Font Size: a A A

Research On Data Evolution And Provenance For Big Data

Posted on:2019-10-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:J X XueFull Text:PDF
GTID:1488306338979329Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
There is still no agreement on the characteristics of "big data",however,"3V"(Volume,Velocity and Variety)is regarded as the most acceptable feature.It can be seen that the feature means not only massive number,but also it emphasizes more varieties and frequent updates.Under the circumstance of "big data",data began to be continually created,collected and transferred.Therefore,the dynamic evolution of data has become the basic characteristic of big data.Data evolution means the process of data gerneration and derivation over time.The problem of data evolution is not a unique problem of big data,it also exists in a small web database.Data evolution can be divided into schema-level evolution and instance-level evolution based on different granularities.The schema-level evolution can be used to modify current model to meet the requirements of the users,but resulting in a series of problems,such as data migration,the reconstruction of the application of old version,data exchange and so on.The scheme-level evolution mainly focuses on the efficiency of evolution,while ignoring the cleaning operation of data and intermedia processing.The intermediate data set is rich in level,different in origin,different in quality and inconsistent in structure.So it is particularly important to analyze the processing information of data production and evolution,further to evaluate the quality and accuracy of data,and then to correct data result.Data provenance is a description and record of the whole process of data producing and evolving.As a fine-grained degree data evolution,it has a variety of applications,including tracing the evolution processes between different data sources and the same source.At the same time,while uncertainty is inevitable during the process of data evolving,data provenance can simultaneously track the origins and the evolution process of uncertainty.However,the high cost of provenance limits the application of traceability,especially when the way of high-efficient traceability computation is the basis of data provenance.Aiming at the problems existing in schema evolution and data provenance,the following aspects are studied in this dissertation.First,in order to achieve the automation of schema evolution,a bidirectional on-demand schema evolution method for user queries is proposed in this dissertation,including:(1)using a large number of queries to determine the target schema.A schema lattice model is created to build schema trees by the nature of lattice and queries.A greedy algorithm is also built to select the optimal target schema from all candidate schemas;(2)coming up with a multi-versions schema evolution method.That is an on-demand bidirectional schema evolution method which helps mapping inversion and composition.This method can effectively support old version application and reduce the storage cost for multi-version data.Second,to solve the problems in consistency maintenance of multi-versions when data updates,the insertion update feedback and delete update feedback are respectively analyzed.and a methods to solve update feedback problems is proposed.The main contributions are:(1)applying an estimation method to figure out the size of virtual version,thus determining whether the insertion feedback has side-effect.And extending the estimation method according to the case with projection operation;(2)constructing the derive graphs from virtual version to current version with the help of semi-ring provenance.Based on this graph,insertion update feedback problems can be turned to the path cut-off problem on the graph,therefore,a way to verify whether the deletion update feedback has side-effect and the optimal deletion strategy are found.Third,in order to support the co-evolution of data and schema,a method based on pivot-dependency is proposed.Our main contributes include:(1)proposing a more general dependency relationship,which is called pivot-dependency.That is an extension of function dependency by incorporating the schema information.While a set of inference rules are also be proposed,the implicit pivot-dependency can be inferred by analyzing the known pivot-dependency.The theoretical proof is provided;(2)analyzing the relationship of schema evolution and pivot-dependency.In addition,a set of inference rules of pivot-dependency in schema evolution process is proposed,which can effectively transfer the pivot dependencies to ensure the semantic relationships for schema evolution;(3)introducing a optimal method for schema evolution with the algebraic properties of pivot-dependency,which can effectively reduce the evolution path length and reduce the computation cost.Fourth,according to the problems of calculation and storage of semi-ring provenance,a semi-ring provenance method based on magic-test is proposed,including:(1)by analyzing the features of calculation of semi-ring,a semi-ring provenance optimization method based on magic-test is given.And a dynamic multidimensional histogram estimation method is put forward to optimize the magic method;(2)proposing a method semi-ring traceability optimization computation method based on the derivation tree,which can effectively optimize the calculation procedure of semi-ring traceability with a complicated data evolution;(3)in order to solve the problems resulted by recursive query,an approximately semi-ring traceability calculation method with essential annotation is proposed.This provenance calculation method can satisfy most application requirements of annotated database,and effectively improve the calculating efficiency of semi-ring traceability.Fifth,the calculating process of semiring provenance requires data acess in multiple,which give rise to the issue of high cost.To resolve this problem,we present an effective the semiring provenance calculation method to avoid repeated acccing the same data.It mainly includes;(1)an approximate iterative method based on Kleene sequence.This method can improtve the cuputational efficiency of the semiring provenance by converting the query processing into solving semiring polynomial equations to avoid a large number of repeated access to the database;(2)analysising the complexity of the computation of semiring provenance,a new method named Newton-like iterative method is proposed,which can optimize semiring provenance computation by reducing the number of iterations.This approach can further accelerate the calculation process of the semiring provenance.
Keywords/Search Tags:schema evolution, data evolution, update maintenance, schema mapping, data provenance
PDF Full Text Request
Related items