Design And Implementation Of A Heterogeneous Data Source Exchange System Based On Spark

Posted on:2020-04-27

Degree:Master

Type:Thesis

Country:China

Candidate:J X Ren

Full Text:PDF

GTID:2428330578954180

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

The advent of the era of big data means that more and more data needs to be stored and used,and in order to meet a variety of business scenarios,it is often necessary to store a piece of data in different library tables or store it in a variety of different In the data source.Choosing the right data source to read and write data according to the type of business can fundamentally improve the overall performance of the system.Traditional heterogeneous data source data exchange technology is mainly based on the way data source drives connections.At the same time,multi-thread technology is used to read data from one data source and convert it to another data source.When dealing with large-scale data,the traditional data source-driven combined with multi-threaded technology can not meet the data exchange task of large data volume.In particular,as distributed storage technologies are widely used in enterprises,such as the distributed file storage system HDFS,traditional heterogeneous data source data exchange technologies cannot meet business requirements in terms of switching speed and data source type coverage.Therefore,this paper proposes to design a heterogeneous data source data exchange system using Spark distributed computing framework,which mainly focuses on the following four aspects:First,the Spark distributed computing framework is used to build an execution engine for data exchange.The data source connector provided by Spark is used to read data from different data sources.After the data is cleaned,filtered,and written to another data source,the entire process from data reading to writing is performed by Spark in parallel.Second,build a high-performance,highly available,and scalable scheduling system for data exchange tasks.The scheduling system adopts a multi-node deployment mode,and each node manages a partial number of data exchange tasks.Each data exchange task is used as an application running on the resource manager Yarn.The scheduling node submits the task to Yarn for scheduling execution and updates the task execution status.Third,construct an interactive page for the parameter information used by the data exchange task.The environment configuration,task parameters,cluster configuration and other information required for data exchange task execution are all set by receiving page parameters,thereby increasing the flexibility of the entire system.Fourth,according to the detailed design of each functional module of the data exchange system,and the library table design used for data storage,a real usable data exchange system is constructed,and the performance of the module of the system is tested.

Keywords/Search Tags:

Data exchange, heterogeneous data source, Spark, task scheduling, distributed computing

PDF Full Text Request

Related items

1	Research Of Task Scheduling Strategy For Heterogeneous Cluster In Spark Computing Environment
2	The Elastic Resource Allocation And Task Scheduling Of Spark
3	Research On Spark Task Scheduling Technology Based On Execution Time Prediction
4	Researches On Heterogeneous Computing System And Task Scheduling Algorithm Based On CPU-GPU-FPGA
5	Research And Application Of Energy Efficiency Model And Task Scheduling Based On Heterogeneous Spark Cluster
6	Research Of Task Partition And Resource Allocation Algorithms For Load Balance In Spark Computing Environment
7	Spark Task Scheduling With Data Skew And Deadline Constraints
8	Task Scheduling For Spark Application With Data Affinity In Heterogeneous Cluster
9	Heterogeneous Data Source Data Exchange Engine Design And Implementation
10	A Study Of Key Technologies About Task Scheduling On Distributed Stream Computing Platform