Design And Implementation Of Distributed ETL Solution For Industrial Big Data

Posted on:2018-01-30

Degree:Master

Type:Thesis

Country:China

Candidate:M G Cai

Full Text:PDF

GTID:2348330536966521

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Since entered the industrial 4.0,the rapid development of Internet and computer technology combined with industry system closely.Industrial big data analytics play a key area of competitive advantage in the global market.With the development of Internet and information physical system,more data can be collected and analyzed.Also,they can be used to make more decisions.In the industry process of data analysis,how the data extract from various sources into analysis system,the real-time data extract from each sensor into the analysis system is the basis of analyzing in the system.This is need to use data processing tools ETL(Extract-Transform-Load,extraction,transformation and loading).Many traditional ETL are used under the stand-alone systems in parallel.The processing speed and capacity can not satisfy the requirement of industrial data analysis.Business ETL performance is good,but the price is expensive.And the requirement of hardware system is too high.In view of the above situation,this paper designs and realizes a kind of low cost and high performance distributed ETL system.In this paper,the distributed ETL system designs three modules: data extraction module,data transform module and the data load module.Data extraction phase mainly designs data capture solution,data synchronization solution and the Pub/Sub communication mode.According to the requirement of the data processing speed,data transformation stage mainly designed batch processing layer and accelerating layer.The batch processing layer mainly process the historical data which do not require high speed.This layer implement mainly base on Hadoop.Accelerating layer mainly process real-time data.This layer implement mainly base on the Spark Streaming processing.Data loading phase is mainly compose of Sqoop and HDFS client.Sqoop is used to deal with the not structured data.While,HDFS client is used to deal with structured data.Finally in this paper,we make function test and performance test to the distributed ETL system.Experimental results show that the ETL system has a good performance in large industry data processing.This design has stronger practical significance on the informationize reform of industrial data.

Keywords/Search Tags:

Industrial Big Data, ETL, Distributed, Real-time, Hadoop, Spark

PDF Full Text Request

Related items

1	Application Research Of Real-time Data Analysis Based On Spark Computing
2	Implementation Of Industrial Big Data Monitoring And Analysis Platform Technology Based On Hadoop
3	Research And Implementation Of Spark Real-Time Recommendation System
4	Design And Implementation Of A Distributed And Real Time Video Stream Data Processing Platform Based On Spark
5	The Design And Implementation Of Log Real-time Analysis System Based On ELk Stack And Spark
6	Design And Implementation Of Data Real-time Analysis And Processing System Based On Spark
7	Design And Implementation Of Log Analysis System Of SIM Card Management Platform Based On Hadoop
8	The Design And Implementation Of A Real-time Query System For Massive Data Based On Spark
9	The Performances Of Distributed Big Data Processing Modes In High-speed Traffic Network
10	Design And Implementation Of Real-time Recommendation System Based On Spark