/howto/sharding/Overview
Sharding Overview
The Shard adapter aggregates data from multiple sources in real-time, regardless of their storage format, communication protocol, or authentication mechanism, and exposes a metadata layer that is accessible using SQL or REST commands. The conceptual diagram below shows a Shard View that aggregates data from from SalesForce, a flat file, and SharePoint Online. The aggregation mechanism is performed in parallel by Enzo, and the WHERE clause of the original request (if any) is forwarded to the underlying system whenever possible for maximum performance.
The Shard adapter provides advanced capabilities, such as:
- Multithreaded Data Aggregation: aggregates data from multiple sources using parallelism
- Heterogeneous Aggregation: merges data from disparate data sources into a single, uniform output
- Filter Logic Propagation: pushes WHERE clause logic to downstream systems for higher performance when possible
- Metadata Layer: creates a View definition that offers strong schema binding
- SQL and HTTP/S: exposes a shard through SQL and HTTP/S calls
- Hierarchical views: supports shard layering to query subsets of a shard
This article provides a simple use case of the Shard adapter: SQL Server table sharding.
The example provided here is the most basic form of sharding, and does not require advanced configuration settings. The Shard adapter is the only one required to follow the sample commands.
Database Setup
Let's first setup a few things before we can start working with the Shard adapter.
SQL Server Database
The example provided in this article uses two separate SQL Server databases, and a single table in each database. For simplicity each database will be stored on the same SQL Server instance, however this is not a requirement for the adapter to work; the Shard adapter can merge data from any number of tables stored on any number of database servers (including from different RDBMS engines).
Create Two Databases and Tables
Let's create two databases in the same SQL Server instance; we will also create a
table in each, and add a few records for testing purposes. Note that the schemas of each
table are slightly different to show how you can merge data sources with different column names.
Open SSMS, connect to your SQL Server instance
and run the following T-SQL batch:
The SQL commands in this section must be executed against a SQL Server database directly.
CREATE DATABASE [CRM-FL] GO CREATE DATABASE [CRM-CA] GO CREATE TABLE [CRM-FL].dbo.[Customers] ( ID int identity(1,1) NOT NULL, [Name] nvarchar(50) NOT NULL, [UserId] nvarchar(100) NOT NULL, Active bit NULL ) CREATE TABLE [CRM-CA].dbo.[PastCustomers] ( RecordId int identity(1,1) NOT NULL, [FullName] nvarchar(50) NOT NULL, [UserId] nvarchar(100) NOT NULL, CustomerSince int NULL ) INSERT INTO [CRM-FL].dbo.Customers VALUES ('user1', 'user1', 1), ('user2', 'user2', 1), ('user3', 'user3', 1), ('user4', 'user4', 1), ('user5', 'user5', 1) INSERT INTO [CRM-CA].dbo.PastCustomers VALUES ('pastuser1', 'pastuser1', 2019), ('pastuser2', 'pastuser2', 2019), ('pastuser3', 'pastuser3', 2018), ('pastuser4', 'pastuser4', 2017), ('pastuser5', 'pastuser5', 2015)
The above script will create a few records in two different tables. We can now view the records with the following command:
SELECT * FROM [CRM-FL].dbo.Customers SELECT * FROM [CRM-CA].dbo.PastCustomers
Enzo Configuration
Create Connection Strings
Now that the two databases have been created,
let's create a Connection Strings in Enzo that points to the SQL Server instance
that holds the two databases. The simplest way is to use Enzo Manager
to create the Connection String
as explained in this documentation, and
name the connection string sqlserverdata.
Alternatively, you can open SSMS, connect to Enzo directly and execute
this command (be sure to set your database connection string correctly):
The SQL commands in this section must be executed against Enzo directly. See the Quick Start section for details on how to connect to Enzo.
exec Sys.Vault.ConnectionStringCreate 'sqlserverdata', 'server=localhost;uid=sa;pwd=YOUR_PWD'
Configure the Shard Adapter
Finally let's configure the Shard adapter. This can be done from the Enzo Manager or using SQL commands. Let's use SQL commands to configure our shard. Open another connection to Enzo using SSMS (make sure you connect to Enzo directly), and run the following script:
EXEC shard._configCreate 'crmfl', -- the name of this data pointer 0, -- not a default configuration setting 'data', -- a data pointer 'sqlserverdata', -- name of the connection string 'SELECT ID, Name, UserId FROM [CRM-FL].dbo.Customers', -- the actual SQL command to execute 'breadcrumb' -- request a breadcrumb to be added automatically (column: __source__)
The above script creates a pointer to the first table, and is called crmfl;
this is not yet the shard per se.
We have defined a data entry point (a.k.a. data pointer) that fetches data
from the [CRM-FL] database. The breadcrumb option adds an extra column
called __source__; it will contain the name of the shard pointer so you can
identify which data source the data is coming from.
Let's do the same using the second database, except that we will select columns
with an alias so that the second output matches column names from the first table; we will name
this data pointer crmca.
EXEC shard._configCreate 'crmca', -- the name of this data pointer 0, -- not a default configuration setting 'data', -- a data pointer 'sqlserverdata', -- name of the connection string 'SELECT RecordId as ID, FullName as [Name], UserId FROM [CRM-CA].dbo.PastCustomers', -- the actual SQL command to execute 'breadcrumb' -- request a breadcrumb to be added automatically (column: __source__)
We now have two data pointers; we can query each pointer individually using EXEC commands at this time to make sure we obtain the right data:
EXEC shard.ExecShard 'crmfl' EXEC shard.ExecShard 'crmca'
Create the Shard
The next step is to create the actual shard, which will use the previously created data pointers. The main difference is the connection string: instead of specifying the name of a database Connection String, we are listing the data pointers separated by a comma, and specify this entry as a shard.
EXEC shard._configCreate 'crm', -- the name of this shard pointer 0, -- not a default configuration setting 'shard', -- shard entry point 'crmfl,crmca', null, 'breadcrumb' -- keep the breadcrumb
We can now fetch data from both data pointers, through the crm shard created:
EXEC shard.execshard 'crm'
Create and Use an Enzo View
Last but not least, to make this shard easier to query we will create an Enzo View that will point to the crm shard we created previously. The main objective of the Enzo View is to define a firm metadata schema on the shard, and make it available to SELECT commands (including Linked Server connections). The name of the columns is specified in a C# notation below; the string data type is assumed when the data type is missing. Last but not least, creating an Enzo View requires passing the actual command we want to execute (in this case, we want to execute the ExecShard command we just ran above).
EXEC shard._createVirtualTable@ShardTable 'vCRM', 'crm', 'int ID,Name,UserId,__source__', 'EXEC shard.execshard ''crm'''
Run the Shard
Now that an Enzo View has been created, you can run SELECT commands against the Shard directly. The Enzo View called vCRM runs the ExecShard command behind the scenes.
SELECT * FROM shard.vCRM
You can also specify a WHERE clause to filter your data. Because each data pointer runs a SQL command against a relational database, the WHERE clause if forwarded to each data pointed to filter the data at the source for higher performance.
SELECT * FROM shard.vCRM WHERE userId like '%1'
Ad-Hoc Shard Commands
We saw how to build a shard structure with data pointers and a shard pointer and how to execute the default SQL command against the shard. In addition to executing the shard using its configured SQL commands, it is possible to execute an ad-hoc SQL command as long as each data source supports the command format. In this example, the data pointers are all SQL Server databases, so we can send an ad-hoc T-SQL command to the shard. Let's execute a generic command against the shard:
EXEC shard.execshardsql 'crm', 'SELECT * FROM sys.tables'
The above SQL command will return all available tables from each database.