Font Size: a A A

Research And Implementation For Distributed Data Replay System

Posted on:2019-02-27Degree:MasterType:Thesis
Country:ChinaCandidate:Y F ZhouFull Text:PDF
GTID:2348330545975248Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the big data applications in some particular areas,such as financial and security industry,there are many demands for data replay services over large-scale static historical data.In a replay application,data are first queried and loaded from some underlying storagy systems,and then processed with some operations specified by users.Finnaly,the results are transferred into a live data stream,and replayed to the ontop applications.For example,stock security trading and on-line e-business services need data replay services to conduct system testing or ex-post review and analysis.However,such data replay systems are lacked in such areas.As discussed above,a historical data replay system belong to a special kind of system.It not only needs complex query abilities,but also needs stream-out processing abilities,which differs from stream processing systems and database systems.Thus,the existing systems,including stream processing systems and database systems,lack the data replay ability.stream processing systems are internally designed for processing dynamic data streams with the stream-in semantics,thus can not perform complex replay jobs over the static historical data.Database systems support easy to use complex queries,but lack stream processing abilities,which are required in stream replay services.In a conclusion,it is necessary to research and implement a data replay system to provide replay services over static historical data.For the background and requirements discussed above,this paper designed a general data replay model and framework,and based on this model and framework,designed and implemented an efficient distributed data replay system.The main work and contributions of this paper are as follows:(1)To support flexible semantics and logics in data replay processing,this paper proposed a general data replay model and framework.This model and framework combines stream processing abilities and complex query abilities together.Also some built-in replay operators are designed to express the flexible and complex operations in the ontop replay applications.(2)Based on the above replay model and framework,this paper designed and implemented an efficient distributed data replay system called Penguin.Penguin allows the developers to build up high-throughput replay applications to query and replay large-scale static historical data from multiple underlying data sources.Besides,Penguin provides high and stable QoS of replay services with tunable replay speeds.(3)To further improve the data replay performance,this paper proposed two system-level optimizations,including caching loading task results,and loading a single replay data stream in parallel.(4)Experimental results over replaying millions of records demonstrate that Penguin can achieve up to 2.5x and 47x speedup in data preparation and up to 8x and 7x speedup in replay speed compared to Apache Phoenix and Apache Hive,respectively.(5)As a case study,Penguin has been deployed in the production environment of Huatai Securities to provide online historical stock data replay services to a large number of stock market users.This paper proposed a big data replay compute model,and designed and implemented an efficient distributed data replay system.This system has been deployed and used in the real-world industrial applications in security area.And,due to its novelty,a research paper based on this system has been accepted by the TPDS journal(CCF A journal).
Keywords/Search Tags:Data replay model, Data query, Stream processing, Data replay system, Distributed processing
PDF Full Text Request
Related items