Enzo Unified | Implement Data Masking for Dev/Test Environments with DataZen

This blog provides an overview of Data Pipelines in Enzo DataZen, with a focus on data masking and hashing. Data masking and hashing allows organizations to build or refresh lower environments by copying production data, while protecting sensitive information for compliance or security reasons.

Scenarios
With DataZen and data pipelines you can implement the following scenarios:

Create lower environments with masked production data
Refresh lower environments with masked new/updated production data
Share data with B2B partners while masking key customer data
Hash sensitive columns keeping uniqueness of records
Filter production data differently based on the target environment

For example, you can easily implement the following data refresh lifecycle with DataZen using data pipelines: production data stored in Oracle is refreshed on a periodic basis to the Staging environment with an initial data masking of highly sensitive data and filtering executive pay records. The Quality Assurance environment obtains a further masked set of data, with even fewer records. Finally, the R&D environment uses a minimal set of masked records from production data, but requires the information on other platforms for integration purposes, such as Teradata and MySQL.

DataZen allows cross database replication, from any source to any target. Data masking and hashing takes place during the replication process, either while reading from the source system, and/or before applying changes to the target system.

Overview of Data Pipelines in DataZen

DataZen implements the concept of a data pipeline for its replication jobs. The data pipeline feature of DataZen is a data processing engine through which data flows and may be modified as desired. Four components are available in a data pipeline: Data Filter, Data Hashing, Data Masking and Data Quality. You can combine any number of components in each replication job, and choose the order of execution of each component.

Data Filtering

Applies a secondary filter to the data by adding a SQL Where clause as a Data Filter. This can be useful when the data source doesn’t support a full SQL syntax, or when filtering needs to occur after hashing or masking has occurred.

Data Hashing

Applies a hash algorithm to a selected data column (must be a string data type); supported hashing algorithms are MD5, SHA1, SHA256, SHA384 and SHA512.

Data Masking

Applies masking logic to a selected data column, such as credit card number or a phone number. Supports generating random numbers, free-form masking, and generic / full masking.

Data Quality

Validates simple rules against a data column and throws a warning or an error if the data doesn’t match the specified rules.

Hashing phone numbers, and masking addresses with a data pipeline

Data Masking

To ensure sensitive data is protected data pipelines offer two components for modifying data before replication takes place: data masking, and data hashing.

Data Masking

Masking data takes place by transforming the original value to a non-identifiable output. Generally speaking, data masking does not attempt to generate an output that is unique in nature; if uniqueness needs to be kept, see Data Hashing. Five data masking options are available:

Default – replaces a string field with the character X; replaces a numeric field with 0; replaces a date field with 1900-01-01 and NULL for other data types

Credit Card – Exposes the last 4 digits of the credit card and replaces all other characters with X

Email – Exposes the first character, and replaces the remainder with XXX@XXX.com

Random Number – Replaces a numeric field with a random value based on the data type

Custom String – Exposes the first few and last characters, as desired, and places a constant string in the middle as provided

Masking the address field with custom transformation

Data Hashing

Hashing data is also a data transformation mechanism, in which the source column is considered sensitive, but should be modified in a way that each hash is unique and consistent for a given input. For example, a Social Security number may be a good candidate for hashing: every hash value will be unique given the original SSN, and will not change with future replications taking place as long as the source SSN hasn't changed.

Hashing allows organizations to implement testing routines on known records without knowing the actual value of the production record. For example, the phone number '28303384290' will be hashed as '9A091F5CC202E565F72C17F2D84C6042' using an MD5 algorithm. As long as the phone number doesn't change, the hash value will always be the same.

Hashing two sensitive fields: phone numbers and addresses

Conclusion

In this blog, we reviewed how DataZen implements data masking and data hashing as part of its replication topology, regardless of the source or target system. This capability allows organizations to implement data de-identification for lower environments, such as dev/test databases, and sharing partial information with business partners.

NEW !!! INTRODUCING DATAZEN

THE ANY-TO-ANY DATA INGESTION ENGINE

LEARN MORE CONTACT US

Hybrid Data Integration

Integrate data from virtually any source system to any platform.

Automated Change Capture

Support for High Watermarks and fully automated Change Data Capture on any source system.

Code-Free Ingestion and Enrichment Pipeline

Extract, enrich, and ingest data without writing a single line of code.

To learn more about configuration options, and to learn about the capabilities of DataZen, download the User Guide now.

USER GUIDE

Implement Data Masking for Dev/Test Environments

Overview of Data Pipelines in DataZen