Design And Implementation Of Distributed Stream Computing System Based On Spark

Posted on:2020-07-31

Degree:Master

Type:Thesis

Country:China

Candidate:J C Wang

Full Text:PDF

GTID:2518306104996159

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the increasing number of real-time scenarios,such as real-time ETL,complex event processing(CEP)and real-time statistical analysis,it lays the foundation for the development of stream computing.Most of the data processing systems of the previous generation Hadoop ecosystem are aimed at offline computing scenarios,and their system design has been unable to cope with real-time computing services.Therefore,this thesis provides a Spark-based stream computing system specifically for real-time scenarios,which can satisfy most real-time scenarios and has been applied in multiple production environments.This thesis implements a distributed stream computing system based on Spark.Based on Spark task scheduling engine and task execution engine,the system develops operators for real-time ETL and real-time data statistical analysis in the middle layer,such as JOIN operator for stream and stream,JOIN(Global Lookup Join)operator for streams and large tables,Join(Map Join)operator for stream and small tables,Group By operator,and Order By operator.At the same time,the system also developed a CEP(Complex Event Processing)operator that specializes in processing complex events.In order to ensure that these operators are guaranteed to run continuously for 7*24 hours in a distributed environment,a distributed fault-tolerant system based on distributed snapshot has been specially developed.In a cluster that maintains a large number of stream tasks,the cluster state is critical to the operation of the cluster.In order to provide the internal state of the system to a third-party monitoring visualization system or Studio(application development tools and monitoring tools of stream computing system),a task indicators and status monitoring system has been specially developed.the system supports Akka API,Restful API and Report three ways to provide indicators and status.This thesis gives a detailed introduction to the design and development of the above modules from several aspects such as business scenario,system design goal,system architecture design,anddetailed implementation of the system.Through rigorous functional testing and performance testing of the system,it is proved that the functions of the system are normal,and the performance can meet the daily business needs,in line with the original design goals of the system.Through this system,customers can develop streaming computing applications only by using SQL,simplify the development process of streaming computing tasks,and improve development efficiency.

Keywords/Search Tags:

Stream computing, Scheduling engine, Distributed snapshot, Watermark, Complex event processing

PDF Full Text Request

Related items

1	Distributed Complex Event Stream Processing Engine Research
2	Research And Implementation Of Distributed Complex Event Peocessing System
3	The Design And Implementation Of Complex Event Processing Engine Based On Distributed Event Communication
4	Research On Complex Event Detection Technology Over Event Stream
5	Research On Key Technologies Of Distributed Rank-aware Query Processing
6	The Design And Implementation Of An Event Stream Processing System For Wireless Sensor Network
7	Design And Implementation Of Self-adjusting Dynamic Stream Processing Engine
8	Research On Complex Event Processing Techniques For Temporal Uncertainty Model
9	Financial Transactions Risk Early Warning Applications Based On Complex Event Processing
10	The Design And Implementation Of The D-Stream Stream ProcessingSystem Which Supports Dynamic Task Topology And Load Shedding