Design And Implemetation Of Data Replication Center

Posted on:2021-07-03

Degree:Master

Type:Thesis

Country:China

Candidate:H D Ren

Full Text:PDF

GTID:2518306503999489

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Data Replication Center aims at solving data syncing problems,such as replicating the online data into data warehouses for analysis.The famous change data capture pattern could be applied to subscribe real-time updates to refresh a cache,a search engine or trigger an asynchronous business process.For supporting data center level fault tolerance,it's a must to transfer data between different instances.There are some existing solutions in this area.However,some are only command line tools that lack high availability.Some support specific data source but can't be extended easily.Most of them have the same consistency level as the source,while only a few or even none explicitly define their consistency guarantee.Moreover,in both cases,they fail to optimize for the replication scenario.This paper introduces the design and implantation of a data replication center.In design,first,system requirements are concluded from use cases.Second,a replicated state machine based cluster solution for bare metal and a cloud native application for Kubernetes clusters are presented.Both of them can eliminate the single point failure,provide high availability and scalability.For extensibility,a plugin-based execution engine makes it easy to support different kinds of data sources.Finally,as the core of execution engine,a row level consistency semantics and the corresponding concurrency control algorithm are presented which is suitable for the replication scenario.Compared to conventional solutions,this semantic sacrifices some consistency requirements for better throughput.Batch effect is leveraged for higher throughput.In implementation,a unified message structure provides a uniform abstraction of different data sources,which reduces coupling and enhances extensibility.To reduce the cost of scanning a big table,we present a general primary key paging algorithm,a sample segment parallel read algorithm and abandonment of the global consist read.The system makes cyclic replication possible by labelling the traffic.It also stores the log of metadata to provide higher availability on edge cases.Most data sources don't provide historical table structures,which makes it error prone when the replication system restarts on continuous metadata change.Finally,we mention three additional efforts for the observability of the system.Chaos engineering is utilized to prevent bugs ahead.Inspired by streaming systems,a real-time data checking framework is presented and watermarks are continuously monitored.The system has been deployed to production for one and a half years.Some tasks have transferred thousands of TB data.Some of them are doing real-time syncing for several billion rows per day on hundreds of data source instances.The 95 th percentile of the end to end latency is less than 100 ms,including 95 ms read and write latency through datasources.Syncing from My SQL to My SQL,throughput reaches 50 thousand rows per second in a 4 CPU docker container,which is much faster than its native replication.

Keywords/Search Tags:

Change Data Capture, Consistency, Batch Optimization, Distributed System, Cloud Native

PDF Full Text Request

Related items

1	Design And Implementation Of Multi-source Data Quality Verification System Based On Change Data Capture Technology
2	Research On The Consistency Maintenance Mechanism Of Data Intensive Distributed Storage Systems
3	Research And Optimization Of Distributed Consistency Algorithm For Wide Area Distributed Storage System
4	Design And Optimization Of Data Consistency Protocol For Distributed Storage System
5	The Formal Modeling And Optimization For Data Partitioning And Replica Consistency In Distributed Storage Systems
6	Research On Cloud-Native Virtual Network Function Orchestration
7	Research On The Problem Of Data Consistency Of Distributed File System In The Cloud Computing Environment
8	Research On Key Technologies Of Cloud Native Self-Configuring Web Software
9	Research Of Distributed Data Inconsistencies In The Cloud Computing Environment
10	Research And Implementation Of SQL Server Database Change Data Capture Based On Log Analysis