Research And Implementation For Distributed Data Replay System

Posted on:2019-02-27

Degree:Master

Type:Thesis

Country:China

Candidate:Y F Zhou

Full Text:PDF

GTID:2348330545975248

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

In the big data applications in some particular areas,such as financial and security industry,there are many demands for data replay services over large-scale static historical data.In a replay application,data are first queried and loaded from some underlying storagy systems,and then processed with some operations specified by users.Finnaly,the results are transferred into a live data stream,and replayed to the ontop applications.For example,stock security trading and on-line e-business services need data replay services to conduct system testing or ex-post review and analysis.However,such data replay systems are lacked in such areas.As discussed above,a historical data replay system belong to a special kind of system.It not only needs complex query abilities,but also needs stream-out processing abilities,which differs from stream processing systems and database systems.Thus,the existing systems,including stream processing systems and database systems,lack the data replay ability.stream processing systems are internally designed for processing dynamic data streams with the stream-in semantics,thus can not perform complex replay jobs over the static historical data.Database systems support easy to use complex queries,but lack stream processing abilities,which are required in stream replay services.In a conclusion,it is necessary to research and implement a data replay system to provide replay services over static historical data.For the background and requirements discussed above,this paper designed a general data replay model and framework,and based on this model and framework,designed and implemented an efficient distributed data replay system.The main work and contributions of this paper are as follows:(1)To support flexible semantics and logics in data replay processing,this paper proposed a general data replay model and framework.This model and framework combines stream processing abilities and complex query abilities together.Also some built-in replay operators are designed to express the flexible and complex operations in the ontop replay applications.(2)Based on the above replay model and framework,this paper designed and implemented an efficient distributed data replay system called Penguin.Penguin allows the developers to build up high-throughput replay applications to query and replay large-scale static historical data from multiple underlying data sources.Besides,Penguin provides high and stable QoS of replay services with tunable replay speeds.(3)To further improve the data replay performance,this paper proposed two system-level optimizations,including caching loading task results,and loading a single replay data stream in parallel.(4)Experimental results over replaying millions of records demonstrate that Penguin can achieve up to 2.5x and 47x speedup in data preparation and up to 8x and 7x speedup in replay speed compared to Apache Phoenix and Apache Hive,respectively.(5)As a case study,Penguin has been deployed in the production environment of Huatai Securities to provide online historical stock data replay services to a large number of stock market users.This paper proposed a big data replay compute model,and designed and implemented an efficient distributed data replay system.This system has been deployed and used in the real-world industrial applications in security area.And,due to its novelty,a research paper based on this system has been accepted by the TPDS journal(CCF A journal).

Keywords/Search Tags:

Data replay model, Data query, Stream processing, Data replay system, Distributed processing

PDF Full Text Request

Related items

1	Design And Implementation Of Traffic Replay System
2	The Design And Implement Of Data Record And Replay Based On QNX Real-time Muti-channel Digital Synchronized Log
3	Research On Some Key Techniques Of Distributed Data Stream For Query Processing
4	Research On Key Technologies Of Distributed Rank-aware Query Processing
5	Research On Distributed Data Stream Query Processing
6	Research On Adaptive Query Processing Of Data Stream Based On Eddy
7	Research On The Query Processing Technologies Of Distributed Data Stream
8	Research And Implementation Of Mechanism Of Query Processing In Data Stream System
9	Research On The Technology Of Continuous Query Processing Over Data Stream
10	Research On An Application Of Data Stream Query And Data Stream Mining In Oil Field