Font Size: a A A

Study On Methods In Provenance Data Reduction

Posted on:2018-09-10Degree:MasterType:Thesis
Country:ChinaCandidate:H J MiFull Text:PDF
GTID:2348330512991045Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Data provenance refers to reproducing and displaying the process in which the original data arrived at its current state.Data provenance has unique advantages in tracking emigrant data,data reconstruction and evaluating the trustworthiness of data.Hence it has been increasingly adopted in database system,scientific workflow and other fields.In the field of big data applications and information security data provenance is also very promising.However the scale of the provenance data has been a bottleneck in its application.In many provenance-aware systems,in order to ensure the traceability of the target data,provenance data tend to occupy far more space than target data.This problem is more prominent for data provenance's application in big data engineering.Massive provenance data not only severely reduces the efficiency of the querying,make its storage,calculation and management costly,but also make it more difficult to understand the results of lineage query due to the complicated and detailed correlation of the data.All above greatly reduced the quality of the provenance data and directly hindered its popularization and application.The current research on provenance data reduction overseas and domestic mainly based on compressing redundant data and filtering noise,both of whom can't decrease the scale of provenance data fundamentally.This thesis study methods to separation the "cold data" and"intermediate data" which is rarely or never used,as well as study how to coarsen the provenance data based on the characteristics of nodes and dependence between nodes on the premise of keeping the traceability of provenance data.The main contribution of this thesis includes three aspects.First,the study of provenance data hierarchical reduction based on type,that division provenance data into multiple tiers according to the type of target object,strip tiers contain "cold data"which is rarely or never used,and then using the transitivity of dependency to rebuild the dependencies between data items retained.Second,the study of provenance data reduction based on centrality differentials,which identify the start and/or the end of a semantically meaningful task accounting to differentials of nodes' centrality and then extract the data which is more influential within the same task.Third,the study of provenance data reduction based on clustering aims to reduce the provenance data by clustering.We propose a method to classify the data by clustering according to the correlation between data nodes and groups.And then strip nodes have no direct dependencies with other groups.The main innovative contributions of this thesis are as follows.First,proposing a provenance data hierarchical reduction method based on type,which divide the provenance data into different layers depending on the type of target object,and strip layers contain "cold data" to reduce provenance data.Second,adopting a provenance data reduction method based on centrality differentials,which aim to identify the boundaries of the task based on centrality differentials and then reduce provenance data by extracting the boundary data which is more influential and will be the key provenance.Third,proposing a provenance data reduction method based on clustering,which classify the data by clustering according to the correlation of the provenance data and then strip the data describes the internal correlation to reduce provenance data.Finally,based on standard provenance data set collected by PASSv2 which is designed and realized by Harvard University,we give the experiments for the presented methods.The experimental results proved the effectiveness of these methods.
Keywords/Search Tags:Data Provenance, Data Reduction, Centrality Analysis, Graph Clustering
PDF Full Text Request
Related items