Font Size: a A A

Design And Implementation Of Distributed Stream Computing System Based On Spark

Posted on:2020-07-31Degree:MasterType:Thesis
Country:ChinaCandidate:J C WangFull Text:PDF
GTID:2518306104996159Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the increasing number of real-time scenarios,such as real-time ETL,complex event processing(CEP)and real-time statistical analysis,it lays the foundation for the development of stream computing.Most of the data processing systems of the previous generation Hadoop ecosystem are aimed at offline computing scenarios,and their system design has been unable to cope with real-time computing services.Therefore,this thesis provides a Spark-based stream computing system specifically for real-time scenarios,which can satisfy most real-time scenarios and has been applied in multiple production environments.This thesis implements a distributed stream computing system based on Spark.Based on Spark task scheduling engine and task execution engine,the system develops operators for real-time ETL and real-time data statistical analysis in the middle layer,such as JOIN operator for stream and stream,JOIN(Global Lookup Join)operator for streams and large tables,Join(Map Join)operator for stream and small tables,Group By operator,and Order By operator.At the same time,the system also developed a CEP(Complex Event Processing)operator that specializes in processing complex events.In order to ensure that these operators are guaranteed to run continuously for 7*24 hours in a distributed environment,a distributed fault-tolerant system based on distributed snapshot has been specially developed.In a cluster that maintains a large number of stream tasks,the cluster state is critical to the operation of the cluster.In order to provide the internal state of the system to a third-party monitoring visualization system or Studio(application development tools and monitoring tools of stream computing system),a task indicators and status monitoring system has been specially developed.the system supports Akka API,Restful API and Report three ways to provide indicators and status.This thesis gives a detailed introduction to the design and development of the above modules from several aspects such as business scenario,system design goal,system architecture design,anddetailed implementation of the system.Through rigorous functional testing and performance testing of the system,it is proved that the functions of the system are normal,and the performance can meet the daily business needs,in line with the original design goals of the system.Through this system,customers can develop streaming computing applications only by using SQL,simplify the development process of streaming computing tasks,and improve development efficiency.
Keywords/Search Tags:Stream computing, Scheduling engine, Distributed snapshot, Watermark, Complex event processing
PDF Full Text Request
Related items