DataZen is a Change Data Capture (CDC) technology that allows organizations to copy, replicate, and exchange data from any source system into
any target platform, even if those systems are hosted by different companies. This blog provides an overview of key design
principles used by DataZen that enable a flexible and fully decoupled replication architecture, including:
DataZen is designed to replicate data from any source to any target system, regardless of the inherent compatibility of these systems, or whether these systems are located in the same data center. For example, DataZen can replicate data from an Oracle database to a SQL Server database, or from a SharePoint List to Parquet files, or from Parquet files to a messaging platform such as an Azure Event Hub. This capability is possible thanks to the universal change format used by DataZen when creating the CDC Log (the CDC Log is sometimes referred to as a Data Sync File).
The generic CDC Log created by DataZen enables any to any replication.
CDC Log files can be played back on any target system.
The CDC Log is built by extracting data from the source system and turning every record into an internal data row, and when possible, extracting the detailed schema of the source table. This is achieved either directly by DataZen's built-in drivers, Enzo Server's Data Virtualization, or ODBC drivers. This turns any source system into a virtual database with an optional list of primary keys for unique record identification. Whenever possible, DataZen will also extract the source system schema for the data being retrieved. The changed data is then stored in a compressed internal format within the CDC Log directly, which may be broken up into chunks of individual change tables so they can, individually, fit in memory. Each CDC Log can hold both Upserted (inserted and updated) and Deleted records.
Each CDC Log is assigned a unique Execution Id, which represents the timestamp when the log was created. The Execution Id is used to determine the order of execution of the log files, and is both part of the naming convention of the CDC Log File, and stored within the log file itself for auditing purposes. The CDC Log also contains additional metadata information, such as the source Job Name, summary change log information and other properties that are used when applying the log to target systems.
DataZen allows administrators to secure the CDC Change Log using PGP encryption; this ensures that the log file can safely be copied across public networks and can only be played back by parties that have the associated decryption key.
DataZen creates the CDC Log by comparing data retrieved from the source system to the last known values in each row by using an efficient hashing mechanism. The hash of the last known values are stored in an internal Hash Table along with the hash value of the primary keys. This ensures that while DataZen can detect changes in any given row from a source system, the actual values of each row are not stored in DataZen's internal tables for security reasons.
Using its internal Hash Table, DataZen's CDC Engine can quickly identify new, updated, and deleted records. These changes are stored in
the CDC Log described previously. The ability to compare the state of each record using Hash Tables allows DataZen
to generate CDC Logs against any source system. Creating CDC Logs by comparing records from their last known state is
referred to as a Synthetic Change Data Capture; the CDC Log is constructed by inspecting the state of each record at specific
intervals.
Because a Synthetic CDC inspects changes made to the data at a given point in time, not all changes made to the source system
may be detected. For example, if a record has been added then deleted almost immediately, the CDC Engine may not know
that the record was ever created in the first place because only the net changes will be identified.
Certain systems provide their own CDC tables, such as SharePoint Online or SQL Server; when available, DataZen can be configured to query the source system's own CDC table to capture all changes.
Advanced options are available to fetch only the records that were added or updated in the source system by leveraging date/time fields when available. This is particularly important for slower or remote source systems such as SharePoint Online; when doing so, the CDC Engine may need to make a second call into the source system to identify deleted records. DataZen offers advanced options to query source systems for deleted records.
DataZen uses the CDC Log files described previously as the basis for replicating data; since these files encapsulate data, schema, and general configuration settings, they are fully self describing and can be copied anywhere and played back in another DataZen environment, against any target system, at any time. DataZen uses Reader Agents to create the CDC Logs, and Writer Agents to play them back.
DataZen's file-based replication model enables the following capabilities:
B2B Data Replication |
The ability to copy CDC Logs anywhere (including cloud folders) allows two or more companies to exchange/replicate data regardless of the source and target system. The CDC Logs can be stored in Azure Blobs, AWS S3 Buckets, or an FTP site for example. The CDC Logs can also be PGP encrypted for additional security. |
Log Replay & Multicasting |
DataZen offers the option to replay a single CDC Log file, or replay all available CDC Log files in sequence. Since CDC Log files are available for replay, they can be processed multiple times against multiple target systems independently. |
Full Initialization |
By its very nature, the CDC Engine creates an initial log file that contains all the identified source records the first time it runs. This enables DataZen to create an initialization log that can be played against any target system, just like any other CDC Log. |
Shared-Nothing Architecture |
The replication model used by DataZen leverages the benefits of a shared-nothing architecture, a distributed computing model in which the target systems are unaware of the existence of the source systems, and vice-versa. Source and Target systems can operate independently and any part of the replication topology can be upgraded without a system-wide shutdown. |
Schema Independence |
Since each CDC Log contains schema information, each target can select which data elements to replicate. This allows each target to have a different schema if desired. One target could be an HTTP Endpoint, a second target could be a relational database, and a third one could be Parquet files for example. |
Micro Batch Processing |
Because CDC Logs are created on a schedule, capturing changes in bulk at specific intervals, DataZen implements a micro batch processing pattern that reduces network chattiness and as a result improves replication performance. |
This article introduces you to DataZen, a flexible any-to-any replication technology that uses universal Synthetic CDC Log files and a shared-nothing architecture for maximum flexibility. This in turn allows companies to leverage many capabilities, such as secured Business-to-Business replication, replay capabilities, full initialization, and multicasting.
Integrate data from virtually any source system to any platform.
Support for High Watermarks and fully automated Change Data Capture on any source system.
Extract, enrich, and ingest data without writing a single line of code.
To learn more about configuration options, and to learn about the capabilities of DataZen,
download the User Guide now.
© 2023 - Enzo Unified