Enzo Unified | Replicate SQL Server Table to Parquet Files in Azure ADLS 2.0

Replicating SQL Server to Parquet files is quickly becoming a valuable function for modern data management.

Here, we will discuss some of the key benefits of implementing a replication solution for SQL Server to Parquet files in Azure ADLS using Enzo's DataZen ingestion pipeline. DataZen can quickly be configured to replicate a SQL Server table, and continuously forward changes on the table as individual Parquet files directly in an Azure Storage Account ADLS 2.0. This allows organizations to build a Data Lake very quickly using the preferred storage format for Azure Synapse: Parquet files.

The demo video in this article demonstrates how to configure DataZen to achieve this objective and shows you how to query the data using Azure Synapse after the Parquet files are created by DataZen in ADLS.

Why You Need to Replicate SQL Server Data To Parquet Files

It's all about increased speed and reduced costs. Modern analytical platforms, such as Azure Synapse, use Parquet files as their underlying Data Lake storage mechanism for performance reasons, including:

Columnar store is the preferred storage format: Parquet files store data in a columnar format. This means that Parquet files store a column of data per row, instead of a record per row in traditional flat files formats such as CSV. This makes reading a Parquet file much more efficient by design.
Data types are embedded in the file: Parquet files also store schema information including the data types of each column. This makes it easy for analytical platforms to consume the underlying data in the right format without further data-type casting.
The can use compression: Parquet files can be compressed using Snappy or GZip. This feature can help reduce transfer times to the cloud and storage space (and as a result, cost) significantly.

These are the primary reasons Parquet files are so popular with many analytics platforms, including Azure Synapse. In addition to enabling efficient data reads within a single file, Parquet files are also used by Synapse as the underlying mechanism to provide a Delta Lake.

A Delta Lake is used to store "delta" files that hold changes to the original data. This way, updates to the primary/initial Parquet file is not necessary. Changes are synthesized quickly so that when a query is executed against a Data Lake, the latest information is presented to the user. This Delta Lake architecture works by leveraging Parquet files as well.

Not only can DataZen create Parquet files to be stored in ADLS for use by Azure Synapse, it can also generate these "delta" files and load them directly into a Delta Lake used by Azure Synapse to ensure the latest data is always available in the lake.

SQL Server to Parquet Architecture Overview

To replicate SQL Server in Azure ADLS storage, we configure DataZen to read data from a source table in a SQL Server database, and choose a change-detection strategy (such as a high-watermark field) on the source table so that DataZen can forward both the initial set of records, and ongoing changes detected, as individual Parquet files.

While this blog uses SQL Server as the data source, using Oracle, Teradata, MySQL, or an ODBC driver as the source system is just as simple.

In the demo provided in the Video, the High Watermark strategy is selected for speed and simplicity; however, DataZen can use multiple change-detection strategies depending on the use case:

SQL Server CDC or Change Tracking: use SQL Server's native change tracking or CDC tables (may not be available depending on your version and edition of SQL Server)
High Watermark Field: Use a timestamp, datetime, or long value as a high watermark (usually works best for log tables or tables that include a last modified date)
Synthetic Change Data Capture: Let DataZen identify changes using its internal synthetic CDC engine by defining Key Columns

Generally speaking, it is best to use a High Watermark field for performance reasons and overall simplicity. In addition, using Synthetic CDC is not recommended when the target system is a file, because updating files usually introduces significant performance degradation.

Demo: Replicate SQL Server to ADLS 2.0 as Parquet Files

Benefits of using DataZen to Replicate SQL Server to Parquet

Using DataZen to implement a replication topology from a SQL Server database to Parquet files in an Azure ADLS environment offers multiple benefits, including:

No Code or Complex ETL Processes

First and foremost, DataZen is a point-and-click configuration system. No custom development is needed and just as importantly no complex ETL diagramming is involved. In other words, DataZen allows system engineers to build their data replication topology without designing ETL jobs or compiling code, saving a significant amount of time.

With DataZen, data engineers can build a data replication topology in ADLS using Parquet files in just a few minutes.

Full Decoupling of Source and Target

DataZen creates intermediate Change Logs that store modified or new records since the last capture. These Change Logs contain data in source and target agnostic format. The decoupling of source and target means that these Change Logs:

Can be replayed selectively or from a point in time
Are portable to another DataZen environment
Can be sent to any target system
Can be sent to multiple target systems

Push Replication, Not Pull

DataZen implements a push replication model, not a pull model. This offers multiple advantages including the decentralization of the replication topology and a more secure environment. A push model eliminates the need to build a master-slave architecture that controls the replication topology. DataZen leverages a distributed, Shared-Nothing architecture that enables a loosely-coupled replication model, which is far simpler to administer.

From a security standpoint, a push model eliminates the need to open inbound ports into your internal network. In this scenario for example, DataZen pushes its changes logs using an HTTP/S connection to Azure directly.

Flexible Data Partitioning

A key feature of the DataZen Parquet target is the ability to partition records across multiple files based on source data and other parameters. To achieve this, data engineers use tokens in the naming convention of the container, the ADLS path, and the file name.

This feature can be used to distribute data across folders by date token, such as year, month and day. In addition, the source data can be used in the naming convention, such as a Country code, a LocationId value, or any other data element.

High Resiliency

DataZen sends change logs to the target systems in order, as soon as possible. If change logs cannot be sent due to network connectivity issues, DataZen will continue to capture changes from the source system and create additional change logs as usual. As soon as the network link is restored, DataZen will replay all the change logs that were not successfully sent, in sequence. This enables DataZen to work in environments that have sporadic network connectivity. The Push replication model enables DataZen to create its change logs continuously and transmit them when network connectivity to the cloud is restored.

Data Masking, Translation, and Enrichment Pipeline

DataZen also provides a data pipeline engine that allows data engineers to modify the data before it is saved in its change log and/or before it is sent to the target system. This is typically used to format specific fields, add one or more descriptive columns, translate date/time fields, and when needed mask or remove sensitive data. Unlike a traditional ETL (Extract, Transform, Load) job, the data pipeline engine is a lightweight ELT (Extract, Load, Transform) tool that functions as an inline set of small rules that are applied to each record.

Simplify Your Data Management with a DataZen Ingestion Pipeline

Replicating SQL Server to Parquet files in ADLS is the beginning of how DataZen can improve the data sourcing needs in your environment. With a no-code data ingestion pipeline like DataZen, you can also address other critical data management functions, such as HTTP/S and hybrid data ingestion and messaging integration.

NEW !!! INTRODUCING DATAZEN

THE ANY-TO-ANY DATA INGESTION ENGINE

LEARN MORE CONTACT US

Hybrid Data Integration

Integrate data from virtually any source system to any platform.

Automated Change Capture

Support for High Watermarks and fully automated Change Data Capture on any source system.

Code-Free Ingestion and Enrichment Pipeline

Extract, enrich, and ingest data without writing a single line of code.

To learn more about configuration options, and to learn about the capabilities of DataZen, download the User Guide now.

USER GUIDE

Replicate SQL Server Table to Parquet Files in Azure ADLS 2.0

Why You Need to Replicate SQL Server Data To Parquet Files

SQL Server to Parquet Architecture Overview

Demo: Replicate SQL Server to ADLS 2.0 as Parquet Files