Font Size: a A A

Design And Implemetation Of Data Replication Center

Posted on:2021-07-03Degree:MasterType:Thesis
Country:ChinaCandidate:H D RenFull Text:PDF
GTID:2518306503999489Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Data Replication Center aims at solving data syncing problems,such as replicating the online data into data warehouses for analysis.The famous change data capture pattern could be applied to subscribe real-time updates to refresh a cache,a search engine or trigger an asynchronous business process.For supporting data center level fault tolerance,it's a must to transfer data between different instances.There are some existing solutions in this area.However,some are only command line tools that lack high availability.Some support specific data source but can't be extended easily.Most of them have the same consistency level as the source,while only a few or even none explicitly define their consistency guarantee.Moreover,in both cases,they fail to optimize for the replication scenario.This paper introduces the design and implantation of a data replication center.In design,first,system requirements are concluded from use cases.Second,a replicated state machine based cluster solution for bare metal and a cloud native application for Kubernetes clusters are presented.Both of them can eliminate the single point failure,provide high availability and scalability.For extensibility,a plugin-based execution engine makes it easy to support different kinds of data sources.Finally,as the core of execution engine,a row level consistency semantics and the corresponding concurrency control algorithm are presented which is suitable for the replication scenario.Compared to conventional solutions,this semantic sacrifices some consistency requirements for better throughput.Batch effect is leveraged for higher throughput.In implementation,a unified message structure provides a uniform abstraction of different data sources,which reduces coupling and enhances extensibility.To reduce the cost of scanning a big table,we present a general primary key paging algorithm,a sample segment parallel read algorithm and abandonment of the global consist read.The system makes cyclic replication possible by labelling the traffic.It also stores the log of metadata to provide higher availability on edge cases.Most data sources don't provide historical table structures,which makes it error prone when the replication system restarts on continuous metadata change.Finally,we mention three additional efforts for the observability of the system.Chaos engineering is utilized to prevent bugs ahead.Inspired by streaming systems,a real-time data checking framework is presented and watermarks are continuously monitored.The system has been deployed to production for one and a half years.Some tasks have transferred thousands of TB data.Some of them are doing real-time syncing for several billion rows per day on hundreds of data source instances.The 95 th percentile of the end to end latency is less than 100 ms,including 95 ms read and write latency through datasources.Syncing from My SQL to My SQL,throughput reaches 50 thousand rows per second in a 4 CPU docker container,which is much faster than its native replication.
Keywords/Search Tags:Change Data Capture, Consistency, Batch Optimization, Distributed System, Cloud Native
PDF Full Text Request
Related items