Font Size: a A A

Research On Key Technologies Of Big Data Provenance For Data Security Supervision

Posted on:2022-01-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y Z GaoFull Text:PDF
GTID:1488306521457984Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As a new and energetic realm of economic development,an innovative engine of social development,and a strategic tool for shaping national competitiveness,big data significantly affects people's lives.However,while big data is developing vigorously,it faces increasingly serious security threats.Big data security incidents occur frequently in recent years.The security supervision capacity for big data does not match its important role.Data provenance,which describes the origins of a data object and the operation and processing procedure by which it arrived the current state,is an effective approach for data security supervision.However,owing to the characteristics of big data and big data system,such as large-scale,variety,distribution,and multi-user,the application of provenance in big data supervision faces several technical challenges including provenance model construction,provenance tracing,integration,and quality analysis,which urgently need further research.Focusing on the above-mentioned challenging issues,this dissertation studies the key technologies of supervision-oriented big data provenance,so as to provide theory,technology,and data support for big data security supervision.The achieved research results are as follows:1.A big data system generally integrates heterogeneous data from different data sources and provides diverse data storage and processing frameworks.To support the security supervision of various data objects and their operation and processing procedure in the big data system,a provenance model which can effectively represent the provenance information of various data types and diverse data storage and processing modes is required to be built in advance.In order to address the problem that existing provenance models do not adapt to big data scenarios well,a big data provenance model(BDPM)for data supervision is proposed.Firstly,by analyzing the characteristics of big data,the components of a typical big data system technology framework,and the requirements of data security supervision,the requirements for big data provenance model building are proposed.Then,BDPM model is proposed by extending the widely used provenance model PROV-DM via subtyping and new relation definition.Representing provenance data in the form of directed acyclic graph and according to the main data types of big data and the main components of the big data system,BDPM model refines the provenance node types and expands the provenance relationship types of PROV-DM model to improve the representation capacity and supervision efficiency of provenance,and it is extensible to adapt to the continuously evolving big data system.Finally,according to the proposed big data provenance model building requirements,the satisfiability of BDPM model is evaluated.The results show that BDPM model can effectively represent the entire procedure of various,multi-layer and multi-granularity data objects flowing and evolving under the collective influence of multiple data storage,processing and communication components in the big data system.2.In the big data system,the provenance information required for data security supervision usually involves multiple users,applications,and working nodes.Currently,only the multi-log analysis-based provenance tracing method has the capacity to obtain the complete provenance information required to represent the entire operation and processing history of a data object in the big data system.However,the provenance information that can actually be obtained is limited by the inherent information of logs.The theoretical feasibility of this method,that is,whether required provenance information can be completely obtained on the basis of existing logs needs to be proved before constructing such a method.In consideration of the variety of provenance and log types,and the complexity of data operation and processing procedure,a special feasibility proof method is proposed.Firstly,the formal definition and proof method of provenance completeness is proposed.Then,employing the Hadoop-based big data system as the research object,in order to prove the feasibility of Hadoop provenance tracing based on multi-log analysis,the required provenance information is specified according to BDPM model and the Hadoop data supervision requirements,and 21 related Hadoop logs as well as the log of Progger,an operating system-level provenance tracking tool,are investigated.Finally,by adopting the proposed provenance completeness proof method,it is proved that for the given provenance types,complete provenance information can be obtained on the basis of the above-mentioned logs,which serves as the foundation for constructing a multi-log analysis-based Hadoop provenance generation method to promote effective data security supervision.3.For the real-time generation of big data provenance based on multi-log analysis in the multi-user,multi-application,and distributed environment,a conjoint analysis method of heterogeneous logs from multiple sources based on auxiliary data structure and multithreading is proposed.Firstly,10 logs are adopted and analyzed in parallel to obtain the provenance information required for Hadoop data security supervision.Secondly,4 auxiliary data structures and 2 auxiliary files are constructed,and 4 child thread creation scenarios are proposed to improve the efficiency and ensure the correctness of log analysis.Then,on the basis of this log analysis framework,the analysis methods of various file operations recorded in different logs under different operation execution modes,parameters,and final status,different file types,sizes,and quantity,and different operation executor types,as well as the collaborative method of each log analysis process are proposed.Furthermore,a provenance invariant-based method for detecting the abnormal behaviors of Map Reduce task workers are proposed and integrated into the log analysis procedure.Finally,the efficiency,correctness,and effectiveness on anomaly detection of the proposed methods are experimentally evaluated.The results show that the log analysis rate of the proposed methods is higher than the highest log generation rate,and the accuracy of these methods can reach 100% by correctly setting the time thresholds which are used to determine the operation type,object,etc.,in the log analysis process,which can support the correct generation of provenance information in near real time to provide strong data foundation for efficient detection of data security threats and accurate control of the data security situation.The proposed anomaly detection method can effectively detect the abnormal operations executed by Map Reduce task workers.4.Owing to the complexity of the provenance tracing environment and provenance generation method,the obtained provenance data inevitably have conflicts or contradictions in the description of the data state evolution process,that is,provenance inconsistencies exist in the obtained data,which affects the effectiveness of provenance on data supervision.For the consistency checking of the provenance data obtained in the distributed and multi-log scenario,a consistency checking method based on provenance graph query and provenance node/dependency ordered sequence analysis is proposed.Firstly,17 consistency rules in terms of structure and attribute that a valid provenance graph should satisfy are proposed on the basis of BDPM model.Then,using Neo4 j graph database to store the obtained provenance data,two consistency checking methods based on provenance graph query are proposed.The first method transforms the violation of consistency rules into database query conditions,and directly checks the consistency rules via provenance graph query.When it cannot be determined whether the provenance data violate the consistency rules only via provenance graph query,the other method is used to firstly output the provenance nodes or relationships to be checked as an ordered sequence through provenance graph query,and then adopt multi-dimensional attribute comparison of sequence records for further check.Experiments on public and generated provenance datasets demonstrate that the proposed method can effectively detect the structure and attribute inconsistencies in a provenance graph and is proved to be efficient and scalable,which ensures the provenance efficiency for data supervision.
Keywords/Search Tags:Big data security, Data provenance, Data security supervision, Provenance model construction, Provenance tracing, Multi-log analysis, Provenance consistency check, Hadoop
PDF Full Text Request
Related items