Font Size: a A A

A Distributed Cache And Analysis Platform For Large Scale Streaming Data Based On Kafka

Posted on:2017-06-27Degree:MasterType:Thesis
Country:ChinaCandidate:M NiuFull Text:PDF
GTID:2348330512954812Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In recent years, with the continuous development of information technology and Internet application, global data scale explosive growth, and that means, Big Data Era is coming. This will not only bring big changes in science and technology research field, but also profoundly affect every aspect of our future daily lives.Nowadays, in the field of big data analysis and computing, because of the advantages such as low cost, high capacity, good scalability, the distribute computer cluster obtained more and more widely used. At the same time, the structure of data which is analyzed and calculated by the distributed architecture cluster is becoming more and more diversified. In recent years, with e-commerce, Internet of things and Internet of financial development,almost all distributed architecture cluster may process a dynamic flow monitoring transmission data and the runtime log file generated by the system at same time. Because different characteristics structure of data, the different algorithm suitable for analysis and calculation method is also different, including the process for data real-time and diversity have higher requirements of dynamic flow data, there are higher requirements on the system throughput and resource utilization of the batch job. And the mainstream of existing distributed cluster system usually only suitable for analysis of a specific data, such as Hadoop,Storm and S4, and unable to adapt to a variety of types of data structures coexist.In this paper, mass flow data cache and analysis distribute platform was innovation presented based on the distribute message system kafka. The design goal of this platform is to cache and organized mass flow of system data that input to the system. And design an online streaming data processing unit and offline batch processing unit that will be selected according data types to choose the appropriate way for system process. Summarizes the advantage of the cache and analysis platform, mainly include the following several aspects:(1)Using a distributed information system for large-scale flow data caching, to improve the ability of platform to adapt the sudden change of dynamic flow data input data volume.(2)Design and implement the online real-time processing units and offline batch processing unit, faced on different characteristics of data in the cluster, in order to meet the requirements of different types of data to calculate real-time and different aspects of the system throughput.(3) The platform use centralized management method, the information and the status of node in the different module will be synchronized to the management module, in order to ensure the consistency of the node information in platform.This article detailed introduces the high level design of the platform architecture. Theplatform is divided into the three main function modules: cache subscription module, online real-time processing module, and system management module. Based on the design, we implements a model of cache and analysis platform which based on the distribute massage system kafka. Finally verify the usability of the platform, scalability and efficiency of the system.Through the platform design and implementation process, this article hopes to the construction of the distributed computing cluster and mass flow data processing to provide new methods. Also hope to improve the platform model, in future research.
Keywords/Search Tags:Distribute, Stream Data, Online Process, Kafka, Hadoop
PDF Full Text Request
Related items