Font Size: a A A

Design And Implementation Of A Metadata-driven Incremental ETL System In Distributed Environment

Posted on:2018-06-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y YangFull Text:PDF
GTID:2348330542451658Subject:Engineering
Abstract/Summary:PDF Full Text Request
Nowadays,modern enterprises will produce large amounts of heterogeneous log every day.It's an urgent need to establish a unified model of data warehouse,whose core task is to design ETL processes.Since the different forms and contents of the log from different business systems,there is a lot of inconvenience in practical application.A universally applicable audit platform is very necessary.In this paper,we proposed a universal storage model for metadata,based on which an ETL system with strong portabilityhas been worked out.This system can be used to execute ETL processes with big datafrom different busines's systems,which standardizes the data and provides standardized dimensions for the following data analysis.The primary research and works made by author of this thesis are the followings:1)Establish data warehouse with snowflake model to standardize the dimensions.Simplify the organization of metadata management,in which metadata is divided of resource metadata,the mapping metadata and data warehouse metadataaccording to their functions.2)Design a data dictionary to establish a dimension management system;make a complete classification and expression of data mapping;Summarize data warehouse storage objects;establish a universalmodel for metadata based on the workabove.This model has many advantages,including unified columnar storage,slowly changing dimensions support,entity relationships decoupling,operating storage support.3)Research on various open source distributed software,including Flume-ng,Kakfa,Hbase,Phoenix,Kettle,based on which we design the"Metadata-Driven Incremental ETL System".This system has strong portability and can apply to different log system.In this system,wedevelop dimension data import policy to ensure that the relationships of entities can bedecoupledefficiently,and implement the automatic incremental data extraction with the Quartz scheduling framework.Besides,we establish a caching framework to improve the efficiency of system access in log conversion.
Keywords/Search Tags:User Behavior Analysis, Data Warehouse, Metadata, ETL
PDF Full Text Request
Related items