Font Size: a A A

Design And Implementation Of Kafka-based Full-Link Stream Data Processing Platform

Posted on:2019-12-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y J XuFull Text:PDF
GTID:2428330545453704Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the arrival of the web 4.0 era and the globalization of IT services,the phenomenal volume of data is growing.Such a large and rapidly arriving data has led to changes in data storage and data processing models.It is to follow daily mode traditional database technology and data processing model,that is,hour or even days as the computing cycle of the current data accumulated,processed and stored.At this moment,flow calculation comes into being.As a kind of real-time calculation model for flow data,data flow delay can be shortened,logic can be calculated in real time,cost can be calculated in parallel,and finally the business needs of real-time processing of big data can be effectively met.On the other hand,since the upstream and downstream data of flow calculation may come from different enterprise or IT system,the consistency of data storage media isomorphism and data format cannot be guaranteed.So we introduce Apache Kafka,which is a real-time message queue.We base on Kafka to build a data highway,to facilitate users to easily access all types of data,end-to-end stream data integration.The full-link stream data processing platform designed in this paper adopts the latest architecture technology based on stream processing and containers to meet the requirements for unified usage of various application scenarios.It provides heterogeneous database replication,real-time synchronization,exchange integration,ETL and other features to achieve a variety of data exchange synchronization and integration of the application scenario.The platform consists of three layers,the physical layer that hosts the big data platform,the service layer that processes the data and the application layer that is suitable for multiple big data services.This article is mainly related to the implementation of service layer and the deployment of the physical layer.The service layer includes data integration,data cleaning,data processing,and data storage:1.Data Integration:Kafka is introduced as a data bus,which supports the pluggable integration of the connector and implements the data source connector.It connects heterogeneous data storage media(such as MySQL,MongoDB,HDFS,etc.)to provide the data foundation for the flow computation.2.Data Cleaning:The direct docking data cannot be directly used for data processing or analysis business to a certain extent.Therefore,this paper designed a data cleaning module,ETL operation on the data(including filter,union,add,sum,etc.)to complete the preparatory work for data analysis.3.Data Processing:Spark Streaming framework is provided to incrementally consume data in Kafka and return the processed result data to Kafka for storage temporarily.4.Data Storage:According to the different data storage methods provided by different downstream IT systems,sink connector makes the real-time data store to each business system.The full-link stream data process platform uses a virtualized Docker container as the underlying platform for streaming computing and uses Kubernetes as a container management and scheduling system to facilitate rapid deployment,migration and maximize resource utilization.As the workload is bigger,the article just introduces my main work,as follows:1.Design and implement Kafka-centric source and sink connector to dock different data.Kafka has two forms of Connectors:SourceConnectors which import data from other systems and SinkConnectors which export data,the management of Kafka Connect and realizes the unified schema for solving intelligently parsed data sources.2.Data processing based on Spark Streaming.Kafka is as a real-time messenger.Spark Streaming processes real-time Streaming data and uses different algorithms for different business needs for data processing.And we return the intermedia results to Kafka temporarily.3.Deploy the platform.The full-link stream data process platform is implemented on kubernetes and deployed in a Docker container.4.Simulation scenario-a real-time order trading to test the availability and efficiency of the platform.
Keywords/Search Tags:Spark Streaming, Kafka, Full-link Stream Data Process, Kubernetes, Docker
PDF Full Text Request
Related items