Font Size: a A A

Design And Implementation Of Distributed ETL Solution For Industrial Big Data

Posted on:2018-01-30Degree:MasterType:Thesis
Country:ChinaCandidate:M G CaiFull Text:PDF
GTID:2348330536966521Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Since entered the industrial 4.0,the rapid development of Internet and computer technology combined with industry system closely.Industrial big data analytics play a key area of competitive advantage in the global market.With the development of Internet and information physical system,more data can be collected and analyzed.Also,they can be used to make more decisions.In the industry process of data analysis,how the data extract from various sources into analysis system,the real-time data extract from each sensor into the analysis system is the basis of analyzing in the system.This is need to use data processing tools ETL(Extract-Transform-Load,extraction,transformation and loading).Many traditional ETL are used under the stand-alone systems in parallel.The processing speed and capacity can not satisfy the requirement of industrial data analysis.Business ETL performance is good,but the price is expensive.And the requirement of hardware system is too high.In view of the above situation,this paper designs and realizes a kind of low cost and high performance distributed ETL system.In this paper,the distributed ETL system designs three modules: data extraction module,data transform module and the data load module.Data extraction phase mainly designs data capture solution,data synchronization solution and the Pub/Sub communication mode.According to the requirement of the data processing speed,data transformation stage mainly designed batch processing layer and accelerating layer.The batch processing layer mainly process the historical data which do not require high speed.This layer implement mainly base on Hadoop.Accelerating layer mainly process real-time data.This layer implement mainly base on the Spark Streaming processing.Data loading phase is mainly compose of Sqoop and HDFS client.Sqoop is used to deal with the not structured data.While,HDFS client is used to deal with structured data.Finally in this paper,we make function test and performance test to the distributed ETL system.Experimental results show that the ETL system has a good performance in large industry data processing.This design has stronger practical significance on the informationize reform of industrial data.
Keywords/Search Tags:Industrial Big Data, ETL, Distributed, Real-time, Hadoop, Spark
PDF Full Text Request
Related items