Font Size: a A A

A Big Data Provenance System With Trustworthiness Evaluation And Collective Anomaly Detection Mechanism

Posted on:2019-04-15Degree:MasterType:Thesis
Country:ChinaCandidate:R Y WangFull Text:PDF
GTID:2428330590967482Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Recently comes the era of big data.Global data networks are processing huge amounts of data in every single second.Till 2017,there is totally 2.7ZB data within the digital world and the increasing speed doubles every year.People are becoming more and more aware of the insight and hidden values among the numbers.Data mining,machine learning and other domains have been rocketing in the speed we could never imagine.Data pipelines have been adopted increasingly broadly by scientists and engineers no matter which domain they work within.For big data analytics especially,it has already become the major purpose for data engineers and data scientists to implement and improve data analytic pipelines and scientific workflows.In order to enhance the developing procedure,the concept of “Provenance” had been proposed,which refers to the history of data workflow for both operations and results.These history records can help analysts better understand details of pipelines,providing finer grains of analysis toward pipelines and workflows,which optimizes pipeline modification and debugging process.However,tools and platforms for big data provenance management have not been well studied and developed currently.Most of those state-of-the-art tools and frameworks are established based on graph data bases,only adapting to relatively rigid pipeline structure and semantics.But it becomes more and more frequent for modern pipelines and workflows with complicated structures to be modified on semantics and get debugged.Moreover,pipeline scale is growing vastly,resulting in problems and phenomenons called “Technical Debts”.Besides,the dramatic expansion in data volume makes users hard to decide when and where the intermediate results should be stored.Modification to different storing strategy also leads to unbearable cost.Further more,pipeline evaluation metrics,targeting at both data and operations,have not got enough attention.There is no universal standard and solution to evaluate the processing quality and capability of certain pipelines.This is one of the significant challenge big data analytics confronts.Human misconducts,malicious attacks can all lead to data quality compromises and lost of operation trustworthiness.This article will solve the problem in collecting and evaluating big data provenance mainly in two aspects:1)Design and implement a provenance system,LogProv,based on stream;2)Design and implement workflow evaluation mechanisms based on the system.The body of LogProv is established on Apache Pig and Hadoop,consisting of four functional parts: 1)Distributed computation cluster;2)Distributed storage cluster;3)Log warehouse;4)Central server that works as the scheduler of other three parts and provides system services interfaces.The computation cluster are mainly scheduled by Apache Pig engine,performing basic computational tasks,producing intermediate and final results and giving away workflow semantic information at the same time.The semantic information is captured by Pig User Defined Funcitons(UDFs)and then sent to the central server.Meanwhile,those UDFs will notify storage cluster to write down intermediate or final results.On receiving the semantic information,central server will store it into the log warehouse and assign a globally unique identifier to corresponding intermediate or final results.After the accomplishment of data pipelines,the central server may receive query requests issued by users and then search in log and data warehouses,reconstruct workflow semantics or return stored data.Currently,query syntax is the same as SQL.By constructing simple workflow,LogProv counted the rank list of wifi hotspot in Geelong,Melbourne.The entire system had stored semantic information according to user's customized demand,and correctly stored both intermediate and final results.The overall overhead is no more than 10%.LogProv also gave fast responds to search request from end user.Costs of the request were in milliseconds include network transmission.LogProv integrates in itself an evaluation mechanism based on Elo algorithm.The algorithm regards every path toward the same output as different competitors,each got a positive or negative score according to the feedback from an oracle.And a optimal path will be selected according to final scores lying on every node.The experiments showed that the algorithm can clearly distinguish operation nodes in different paths.As for the evaluation of every data node,this article studied an anomaly detection mechanism based on statistical distance.This mechanism analyses similarities among data collections by learning distribution features within data sets represented by the same data node.And then reduce the problem to one-dimensional point anomaly detection problems.Moreover,those point anomaly detection problems reduced via statistical distance show better mathematical properties and can therefore be easier and more precise.Experiments that test this technique adopted data sets of Taobao online transaction records,trying to detect “click farming”behaviours.Results showed that the algorithm provided efficient classifiers and was more sensitive than the magnitude of usual cheating behaviours.And it could adapt dynamically to the changing features in sales.It has been shown that the mechanism is highly practical toward real world problems.
Keywords/Search Tags:Big Data, Provenance, Trustworthiness, Anomaly Detection, Distributed
PDF Full Text Request
Related items