Design And Implementation Of Kafka-based Full-Link Stream Data Processing Platform

Posted on:2019-12-20

Degree:Master

Type:Thesis

Country:China

Candidate:Y J Xu

Full Text:PDF

GTID:2428330545453704

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the arrival of the web 4.0 era and the globalization of IT services,the phenomenal volume of data is growing.Such a large and rapidly arriving data has led to changes in data storage and data processing models.It is to follow daily mode traditional database technology and data processing model,that is,hour or even days as the computing cycle of the current data accumulated,processed and stored.At this moment,flow calculation comes into being.As a kind of real-time calculation model for flow data,data flow delay can be shortened,logic can be calculated in real time,cost can be calculated in parallel,and finally the business needs of real-time processing of big data can be effectively met.On the other hand,since the upstream and downstream data of flow calculation may come from different enterprise or IT system,the consistency of data storage media isomorphism and data format cannot be guaranteed.So we introduce Apache Kafka,which is a real-time message queue.We base on Kafka to build a data highway,to facilitate users to easily access all types of data,end-to-end stream data integration.The full-link stream data processing platform designed in this paper adopts the latest architecture technology based on stream processing and containers to meet the requirements for unified usage of various application scenarios.It provides heterogeneous database replication,real-time synchronization,exchange integration,ETL and other features to achieve a variety of data exchange synchronization and integration of the application scenario.The platform consists of three layers,the physical layer that hosts the big data platform,the service layer that processes the data and the application layer that is suitable for multiple big data services.This article is mainly related to the implementation of service layer and the deployment of the physical layer.The service layer includes data integration,data cleaning,data processing,and data storage:1.Data Integration:Kafka is introduced as a data bus,which supports the pluggable integration of the connector and implements the data source connector.It connects heterogeneous data storage media(such as MySQL,MongoDB,HDFS,etc.)to provide the data foundation for the flow computation.2.Data Cleaning:The direct docking data cannot be directly used for data processing or analysis business to a certain extent.Therefore,this paper designed a data cleaning module,ETL operation on the data(including filter,union,add,sum,etc.)to complete the preparatory work for data analysis.3.Data Processing:Spark Streaming framework is provided to incrementally consume data in Kafka and return the processed result data to Kafka for storage temporarily.4.Data Storage:According to the different data storage methods provided by different downstream IT systems,sink connector makes the real-time data store to each business system.The full-link stream data process platform uses a virtualized Docker container as the underlying platform for streaming computing and uses Kubernetes as a container management and scheduling system to facilitate rapid deployment,migration and maximize resource utilization.As the workload is bigger,the article just introduces my main work,as follows:1.Design and implement Kafka-centric source and sink connector to dock different data.Kafka has two forms of Connectors:SourceConnectors which import data from other systems and SinkConnectors which export data,the management of Kafka Connect and realizes the unified schema for solving intelligently parsed data sources.2.Data processing based on Spark Streaming.Kafka is as a real-time messenger.Spark Streaming processes real-time Streaming data and uses different algorithms for different business needs for data processing.And we return the intermedia results to Kafka temporarily.3.Deploy the platform.The full-link stream data process platform is implemented on kubernetes and deployed in a Docker container.4.Simulation scenario-a real-time order trading to test the availability and efficiency of the platform.

Keywords/Search Tags:

Spark Streaming, Kafka, Full-link Stream Data Process, Kubernetes, Docker

PDF Full Text Request

Related items

1	Design And Implementation Of Spark Platform For Big Data Streaming Computing Based On Kubernetes
2	A Distributed Cache And Analysis Platform For Large Scale Streaming Data Based On Kafka
3	Design And Implementation Of Log Stream Analysis Of Computer Room Security Equipment Based On Spark On Yarn
4	Research And Implementation Of Test Data Processing System Based On Spark Streaming
5	Research On Frequent Pattern Mining Methods For Large-scale Date Stream
6	The Research Of Log Processing Platform Based On Apache Kafka
7	Research On Data Stream Clustering Algorithm Based On Spark Streaming
8	Real-time Detecting Of DDoS Attacks Based On Spark-streaming
9	Design And Implementation Of Parallel Processing Platform Of Video Base On Saprk Streaming
10	Designand Implementation Of Data Stream Clustering Algorithm StreamCKS Based On Spark Streaming