| With the increase of Internet data, the traditional data extraction, transformation and loading(ETL) tools cannot adapt to the requirements of the massive data processing. Distributed computing has become the main solution for massive data processing. The distributed ETL technology based on distributed computing framework has important practical significance.The distributed ETL tool is logically divided into two parts, the execution engine and the data interface layer. The execution engine with flexible ETL task processing ability can make full use of a variety of data processing model of Spark, including Spark streaming, Spark SQL etc. Multiple types of data processing, user-defined ETL process and customed transform function with java source are supported by the execution engine. The data interface layer can shield the difference of the different data sources, which is consist of heterogeneous data adaptation layer and data processing layer, and provide a unified data interface to the process node.Another problem of massive data is the problem of data extraction, when the amount of data is huge, total extraction is bound to take up a lot of time and the overhead of the system is too large. The real time incremental migration process based on Streaming Spark is designed to optimize incremental data extraction and combined loading. Because the data which read from the log file of the source database system is correlated in time line, the sequential read module of the incremental extraction node should read record from data source sequentially, and push the record into the Kafka cluster with an auto-increment sequence number. The system use Spark cluster to read data in parallel and transform the original data to a JSON string from Kafka cluster.There is no dependence between the changes in the different records, so we can load different records in parallel, and then when data is loaded, the data in the Spark cluster is hashed in accordance with the name of the database.tablename.primary_key, thedata with the same primary key will be put in the same RDD, and multiple SQL operations are combined to reduce the overhead of the destination database connection according to the operation sequence of the SQL statement.Finally, the distributed ETL tool based on Spark framework is tested. The experimental results show that the system can carry out ETL operation on massive data correctly, it can also extract data from database incrementally and can be expanded well. |