Font Size: a A A

Design And Implementation Of A Heterogeneous Data Source Exchange System Based On Spark

Posted on:2020-04-27Degree:MasterType:Thesis
Country:ChinaCandidate:J X RenFull Text:PDF
GTID:2428330578954180Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The advent of the era of big data means that more and more data needs to be stored and used,and in order to meet a variety of business scenarios,it is often necessary to store a piece of data in different library tables or store it in a variety of different In the data source.Choosing the right data source to read and write data according to the type of business can fundamentally improve the overall performance of the system.Traditional heterogeneous data source data exchange technology is mainly based on the way data source drives connections.At the same time,multi-thread technology is used to read data from one data source and convert it to another data source.When dealing with large-scale data,the traditional data source-driven combined with multi-threaded technology can not meet the data exchange task of large data volume.In particular,as distributed storage technologies are widely used in enterprises,such as the distributed file storage system HDFS,traditional heterogeneous data source data exchange technologies cannot meet business requirements in terms of switching speed and data source type coverage.Therefore,this paper proposes to design a heterogeneous data source data exchange system using Spark distributed computing framework,which mainly focuses on the following four aspects:First,the Spark distributed computing framework is used to build an execution engine for data exchange.The data source connector provided by Spark is used to read data from different data sources.After the data is cleaned,filtered,and written to another data source,the entire process from data reading to writing is performed by Spark in parallel.Second,build a high-performance,highly available,and scalable scheduling system for data exchange tasks.The scheduling system adopts a multi-node deployment mode,and each node manages a partial number of data exchange tasks.Each data exchange task is used as an application running on the resource manager Yarn.The scheduling node submits the task to Yarn for scheduling execution and updates the task execution status.Third,construct an interactive page for the parameter information used by the data exchange task.The environment configuration,task parameters,cluster configuration and other information required for data exchange task execution are all set by receiving page parameters,thereby increasing the flexibility of the entire system.Fourth,according to the detailed design of each functional module of the data exchange system,and the library table design used for data storage,a real usable data exchange system is constructed,and the performance of the module of the system is tested.
Keywords/Search Tags:Data exchange, heterogeneous data source, Spark, task scheduling, distributed computing
PDF Full Text Request
Related items