Enzo Unified | Enterprise Data Replication Patterns Overview

Introduction

This article provides a summary of the primary enterprise data replication patterns used by data engineers when dealing with data integration, data lake, data hubs, data warehouses and real-time messaging scenarios. Data engineering tasks have become increasingly complex over the years as the need for interoperability has increased significantly with the proliferation of SaaS systems. As companies continue to leverage best-in-class, purpose-built SaaS solutions (such as payment automation, AP automation, CRM, ERP and many more), the corporate data landscape is transforming into a series of data islands that are largely disconnected. While SaaS platforms are providing the necessary tools for organizations to compete using the latest and greatest business-centric capabilities, the need to push and pull data from these platforms is increasing due to other priorities, both tactically and strategically.

To address this level of complexity and to provide a more systematic approach to enterprise data integration that can be used regardless of the technologies in scope, the data engineering team at Enzo Unified has established integration patterns that fall in three categories: Stateless, Stateful, and Composite. These patterns are the result of years of successful implementations and lessons learned across industries, small and large companies with various enterprise maturity levels, and technologies.

Notes and Convention

This article describes patterns that are technology agnostic and apply to a variety of platforms both as a source and target, including relational databases, no-sql databases, HTTP/S endpoints, files, and messaging platforms. Unless noted otherwise, these patterns work on all system types.

Data Processing Architecture

The patterns provided below provide support for batch, mini-batch (near-time), and real-time data integration processing. While they support virtually any enterprise integration architecture individually or in combination such as the Remote Site Replication architecture, using both mini-batch and real-time streaming data capture together can enable other well-known architectures such as the Lambda Architecture and the Kappa architecture.

Data Pipelines

Data pipelines are a core component of any data replication topology; every company that moves data, in or out, implements a pipeline mechanism of some kind, even if it is a manual process or simply using an FTP site as the transport medium. Most companies however use a technology platform that will provide the necessary features to implement these patterns.

In some systems, data pipelines may be referred to as data flows, change streams, ETL or ELT pipelines, or even packages or jobs. These pipelines typically introduce a number of capabilities that vary based on the platform used, such as data enrichment, filtering, schema management and more. While important, these implementation-specific capabilities are not listed in the patterns below in order to keep this article focused on enterprise patterns.

Conventions

The following convention is used in the diagrams below. A small database icon is added to the diagram to indicate that state management is needed to implement the pattern, such as keeping high watermark values, change capture state, or a copy of the source or target data for comparison.

Stateless Patterns

These patterns deal with pushing data from one system to another without having to worry about whether or not the data was previously sent (a.k.a. state management). In other words, these patterns push data blindly on demand or on schedule.

Snapshot
This pattern captures the entire data set from the source system, with or without a filter, and is typically used to initialize target systems. If the data set is modified on the fly by adding a timestamp, this pattern can also be used as a point-in-time capture for historical analysis.
Hook
This pattern forwards data that was received by a listener exposed by the data pipeline platform or through a messaging consumer mechanism, used for inserting, updating, or deleting records in the target system.
This pattern typically applies to messaging and HTTP/S webhooks at the source, and unlike other patterns it is the only one where data is injected into the replication system instead of being extracted from the source.
CDC Stream
This pattern forwards the data provided by a native CDC engine; this data normally represents a change stream. The source system is responsible for capturing and identifying which records were added, modified, and deleted. In some cases, keeping a high watermark on the CDC stream is necessary; if this is the case, a CDC Stream + Watermark composite pattern would be used.
This pattern typically applies to certain relational databases at the source, such as SQL Server Change Tracking.

Stateful Patterns

These patterns deal with pushing data from one system to another along with a tracking mechanism to determine whether the data was previously sent. In other words, these patterns push data selectively using a watermark or a synthetic change capture mechanism, on demand or on schedule.

Watermark
This pattern is a forward-only read mechanism that keeps track of a high watermark value, used as a filter against the source data, so that only the most recent records are captured. The watermark is usually a timestamp, a date/time, or a numeric value.
Change Data Capture (CDC)
This pattern is a synthetic operation that reads all available source records and leverages an intermediate state table to identify which records were added, updated, or deleted. The implementation of this pattern usually involves the identification of a unique key (or composite key) and a hashing operation on both the key(s) and the row of data.
Window Capture
This pattern is a read operation that goes back in time (either fixed or sliding) to recapture previously captured data in case some records were modified. This pattern is used when the source system does not provide a way to filter the data based on a high watermark, and is usually combined with the CDC pattern to identify changes.

Composite Patterns

This section combines two or more patterns previously defined, possibly mixing stateless and stateful patterns to provide the desired integration outcome.
The following patterns are some of the most commonly used, but please note that this list is not comprehensive.

Watermark + CDC
A forward-only read mechanism that keeps track of a high watermark value and filters out records that were previously captured. This pattern may be necessary when the source system doesn't observe the high watermark provided with the same level of precision than the mark itself, and as a result may return duplicate records from time to time.
One-Way Sync
An integration strategy that keeps a source system synchronized with one or more target systems, assuming the source system is the system of record. This pattern doesn't require the target system to be read-only; however, it assumes that the source system contains the latest data at all times. Data engineers can also choose to exclude certain fields from being replicated to the target system. The target system may also be considered a source when deleted record identification is necessary.
Two-Way Sync
An integration strategy that keeps two or more systems synchronized, but data can be updated in any system. This pattern usually introduces the possibility to change conflicts and may require the data engineer to use an agreed-upon heuristic to determine which systems will win the conflict. This pattern usually requires the use of CDC at a minimum and sometimes the CDC Stream pattern.
Aggregation
Centralizing data from multiple sources is usually needed when companies manage multiple remote sites, or through Merger and Acquisition are managing a number of disparate systems that provide similar capabilities.
The Remote Site Replication architecture implements this pattern.

Summary

As the need to integrate disconnected systems continues to grow, establishing clear data integration and replication guidance is becoming more important for a number of reasons, from purely tactical to more strategic in nature.

To address this need, Enzo Unified has established integration patterns that fall in three categories: Stateless, Stateful, and Composite. Although the data engineering patterns proposed in this article are technology agnostic, current technology platforms offer various levels of support of the patterns.

As a leader in data virtualization, any-to-any data replication, and data integration, Enzo Unified is here to assist you with your data integration and engineering projects. Contact us at info@enzounified.com for more information.

Enterprise Data Replication Patterns Overview

Introduction

Notes and Convention

Data Processing Architecture

Data Pipelines

Conventions

Stateless Patterns

Snapshot

Hook

CDC Stream

Stateful Patterns

Watermark

Change Data Capture (CDC)

Window Capture

Composite Patterns

Watermark + CDC

One-Way Sync

Two-Way Sync

Aggregation

Summary

As a leader in data virtualization, any-to-any data replication, and data integration, Enzo Unified is here to assist you with your data integration and engineering projects. Contact us at info@enzounified.com for more information.

PLATFORM

SOLUTIONS

RESOURCES

COMPANY

BLOGS AND VIDEOS

IN THE NEWS

OTHER

Enterprise Data Replication Patterns Overview

Introduction

Notes and Convention

Data Processing Architecture

Data Pipelines

Conventions

Stateless Patterns

Snapshot

Hook

CDC Stream

Stateful Patterns

Watermark

Change Data Capture (CDC)

Window Capture

Composite Patterns

Watermark + CDC

One-Way Sync

Two-Way Sync

Aggregation

Summary

As a leader in data virtualization, any-to-any data replication, and data integration, Enzo Unified is here to assist you with your data integration and engineering projects. Contact us at info@enzounified.com for more information.

PLATFORM

SOLUTIONS

RESOURCES

COMPANY