Font Size: a A A

Design And Implementation Of The Enterprise Distributed Data Integration

Posted on:2020-05-11Degree:MasterType:Thesis
Country:ChinaCandidate:X P HeFull Text:PDF
GTID:2428330602950536Subject:Engineering
Abstract/Summary:PDF Full Text Request
In modern enterprises,data analysis and processing often require a lot of data extraction,conversion and loading(ETL),and ETL is a major solution for data integration.Because the existed ETL system of the enterprise has the problem of single machine downtime,it can not meet the requirements of the company's present business.In order to solve the problem of downtime and difficulty of using in the ETL operation,this paper design and implement an data integration system of ETL,which can be integrated various application data within the enterprise for data analysis;at the same time,the results of the data analysis are also provided to the external interface through this system.In the process of ETL data processing,facing the problems of data source diversity,data irregularity and system stability during task execution,this paper uses the distributed service design concept to divide the system into three basic services;using Spark platform's big data processing capability,Kafka's asynchronous decoupling capability and the data query capabilities of search engine,this paper decouple the ETL operations asynchronously,optimize the data extraction schemes,resolve of data processing challenges,and implement a extended distributed ETL data integration platform.Its primary work is as follows:(1)Demand analysis and architecture design: this part details demand analysis of ETL operations and distributed architecture design;due to service life cycle inconsistency,this platform is decoupled into three basic services,detailing the principle of decoupling and the relationship among three services;(2)Detailed design of the three basic services: this part mainly explains the design and implementation of the three basic services,including task scheduler,execution engine and monitoring system.Task scheduler implements the management of the ETL task,including the DAG parsing of the task;execution engine caches the Job parsed by the DAG graph;monitoring system is responsible for monitoring and making intelligent decision in the ETL execution process of data source and target source;(3)Engineering test: this part deployed a test environment,unit test,integration test,distributed deployment test and algorithm test.The project adopts an agile development model and has completed the overall architectural design and two iterative developments.In execution engine part has completed the development of the multi-thread channel mode and the scheduled task for common tasks by using Spark cluster mode,which has been tested and entered the gray stage.In the monitoring part,this paper designed and implemented the database related indicator monitoring and intelligent decision algorithm at the first time,and applied it to the whole project.In the new phase,this plan will add machine learning algorithms for other channels,other types of templates,and some data processing.
Keywords/Search Tags:Data Integration, DAG, ETL, Task Scheduler, Engine, Intelligent Monitoring
PDF Full Text Request
Related items