Design And Implementation Of A Distributed ETL Tool Using Spark

Posted on:2017-01-03

Degree:Master

Type:Thesis

Country:China

Candidate:C L Hao

Full Text:PDF

GTID:2348330503972476

Subject:Computer technology

Abstract/Summary:

With the increase of Internet data, the traditional data extraction, transformation and loading(ETL) tools cannot adapt to the requirements of the massive data processing. Distributed computing has become the main solution for massive data processing. The distributed ETL technology based on distributed computing framework has important practical significance.The distributed ETL tool is logically divided into two parts, the execution engine and the data interface layer. The execution engine with flexible ETL task processing ability can make full use of a variety of data processing model of Spark, including Spark streaming, Spark SQL etc. Multiple types of data processing, user-defined ETL process and customed transform function with java source are supported by the execution engine. The data interface layer can shield the difference of the different data sources, which is consist of heterogeneous data adaptation layer and data processing layer, and provide a unified data interface to the process node.Another problem of massive data is the problem of data extraction, when the amount of data is huge, total extraction is bound to take up a lot of time and the overhead of the system is too large. The real time incremental migration process based on Streaming Spark is designed to optimize incremental data extraction and combined loading. Because the data which read from the log file of the source database system is correlated in time line, the sequential read module of the incremental extraction node should read record from data source sequentially, and push the record into the Kafka cluster with an auto-increment sequence number. The system use Spark cluster to read data in parallel and transform the original data to a JSON string from Kafka cluster.There is no dependence between the changes in the different records, so we can load different records in parallel, and then when data is loaded, the data in the Spark cluster is hashed in accordance with the name of the database.tablename.primary_key, thedata with the same primary key will be put in the same RDD, and multiple SQL operations are combined to reduce the overhead of the destination database connection according to the operation sequence of the SQL statement.Finally, the distributed ETL tool based on Spark framework is tested. The experimental results show that the system can carry out ETL operation on massive data correctly, it can also extract data from database incrementally and can be expanded well.

Keywords/Search Tags:

Distributed compute, Spark, Big data, Incremental data extraction

Related items

1	Research And Implementation Of The Distributed Data Exchange Platform Based On Incremental ETL
2	Design And Implementation Of Forum Data Analysis Platform Based On SPARK
3	The Research On Optimization Of ETL Process And Incremental Data Extraction
4	Design And Implementation Of Distributed Data Mining Algorithms Based On Spark
5	Design And Implementation Of Distributed Auditing System Based On OLAP
6	Research And Implementation Of Incremental Dimensionality Reduction Methods For Big Data
7	A System For Distributed MD Data Analysis Based On Spark
8	Research And Application Based On Spark Text Mining Technology
9	Research On Fast Data Cube Computation Method Based On Spark Platform
10	The Research And Implementation Of Mining Large Data Based On Spark