Font Size: a A A

Research And Application Of Data Lineage Management Technology For ECRM

Posted on:2016-03-25Degree:MasterType:Thesis
Country:ChinaCandidate:X N JiaoFull Text:PDF
GTID:2358330464963498Subject:Engineering
Abstract/Summary:PDF Full Text Request
The symbol of big data era is the emergence of the 4v characteristics of mass data.With the rapid development of the Internet, the size of the large data set which includes log data, personal data, communication information and video data in the process of accessing to the network, as well as the data set produced by the variety of intelligent terminals, has been sharply increased. Facing up to such a large data set, the difficult problem for most of enterprises and customers is that current technologies being used today cannot effectively manage and process data. Therefore, data provenance technology which utilizes HDFS?HBase and Hive on Hadoop cluster provides a feasible solution for the above problem.Data provenance is an information integration entity that contains source data, target data evolved from the source data and evolution process data. i.e., it is an entire procedure information set that contains the initial data and macro data produced by reprocessing. The purpose of data provenance management technology is to manage and trace data bidirectionally, and can supervise the entire data by traceability.The Hadoop cluster system, with the characteristics of high reliability, high scalability,high efficiency, high fault tolerance and low cost, is a complete ecological environment that hosts many kinds of softwares such as data storage software, data mining software,distributed and parallel programming software and distributed collaborative software etc.Among them, Map Reduce, as a programming model of distributed parallel computing framework, is mainly suitable for the batch processing of files; on aspects of data statistic and data query, Hive can be used as a data warehouse tool. Thus, integrating HBase and Hive can greatly reduce the difficulty of data query and statistic processing. Therefore,what taking full advantages of high scalability and low cost features of the Hadoop cluster can not only achieve to store huge amount of structured, semi-structured and unstructured data on a cluster of commercial servers, but also can implement functionalities such as statistic and analysis.Currently, the management technology of data provenance is being widely used gradually in foreign countries. Although domestic research has made some progress on aspects of theory and practice, its applications are still in primary stage. Therefore, how to apply management technology of data provenance, which uses HBase and Hive tools built on the Hadoop cluster, to the government and enterprise effectively and conveniently, is a key research direction for realizing effective management and application of big data.On the basis of deeply research on data provenance, data traceability model andmethods, and query rewriting technology of Perm, we focus on the Hadoop ecological architecture and associative systems of storage and data warehouse. Furthermore, the management system of data provenance for e CRM is developed. In summary, the main contents embody the two following aspects.1. Based on the Hadoop ecological system, by using management technology of data provenance to process and analyze the internal data of the same source, functions including data query, statistical analysis and management of provenance data are implemented.Depending on provenance information, it achieves audit and recovery functions through tracing associative input data upward from result data. In addition, on the basis of rewriting query sentence by using query rewriting technics of Perm, it realizes query and management of provenance information.2. After describing the management system architecture of data provenance,defining operations of provenance management, and integrating HBase and Hive, we developed a data provenance management system of e CRM, and deployed it on a Hadoop cluster successfully.Experiment results show that the data provenance management system of e CRM is not only taking full advantages of the Hadoop ecosystem, but also meeting the demand of the big data management well.
Keywords/Search Tags:Hadoop Ecosystem, HBase, Hive, Map Reduce, Data Provenance
PDF Full Text Request
Related items