Font Size: a A A

Research And Implementation Of Test Data Processing System Based On Spark Streaming

Posted on:2016-10-02Degree:MasterType:Thesis
Country:ChinaCandidate:T X LiFull Text:PDF
GTID:2348330488974399Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of technology, the amount of data generated by aerospace, which is at the forefront of science and technology, is growing exponential rapidly, and the processing speed is also being challenged. At the early stage of research, with regard to hundreds of GB of binary test data, we use multi-machine multi-threaded analytic calculation by Map Reduce parallel computing framework, which performs better than traditional stand-alone multi-thread process. The Map Reduce framework solved the high delay shortcoming largely. However, there still exist some problems with using the Map Reduce framework. Firstly, it is time-consuming to upload the whole package of binary data to the distributed file system before processing; Secondly, the calculation results cannot be real-time display; Thirdly, as to the computing of large-scale tabular data, the method that read data from HDFS by using multiple threads on stand-alone and then cache the calculated data is still unchanged, which has large performance limitations.In order to solve problems mentioned effectively. Firstly, we gave a new framework in this thesis, which is based on a distributed architecture of Kafka, Spark Streaming and Redis. These three parts are responsible for real-time data acquisition, analytic calculation and caching. Data collection is divided into three parts, they are the data producers, data consumers and message data queue. The producer is the data collection points distributed in different experimental fields. The message data queue is Kafka, which used to collect the large data of high speed and low-latency among the subsystems, and reduce the complexity of the network. The consumer is Spark Streaming, which is responsible for the real time analytic calculation. The engine of real-time analytic calculation regards the binary data from different test field as their input, it converts the data stream into continuous data segment with 2 seconds as a batch, and then converts each data segment into a distributed data set(Resilient Distributed Dataset, RDD), which can be used by the analytic calculation engine. The operation of real-time analytic calculation on the data stream is transformed into operation of Spark on the RDD. After analytic calculation, the system save calculated data to the non relational memory database Redis. Redis implements the fast cache of calculation results, avoids the data being written to the hard disk, and provides the guarantee for the real-time display of the results.Secondly, according to the new framework designed in this thesis, we analyzed and improved the performance of data collection and analytic calculation. From the architecture of message queue and data transmission, we improve performance by dividing each Topic into multiple partitions, caching and compressing data to be transmitted. We also adopt the method of balancing Spark Streaming's data received window size and speed, using Redis connection pool, etc., to optimize the process of data consumption, analytic calculation and data cache.Finally, the system is deployed and tested under the experimental environment, which validate that the architecture could avoid the time consuming step of uploading data and could also solve the problem that calculated result cannot be real-time display. Test results show that the performance of the new system based on the stream computing framework is much better than Map Reduce framework that is applied at the early stage of research, which could achieve the intended purpose.
Keywords/Search Tags:Binary Data, Spark Streaming, Kafka MQ, Redis Cache, Distributed System
PDF Full Text Request
Related items