How to Read and Explore CSV Files from SQL Server

July 2020
Enzo Server



In this blog we will review in detail how you can easily read and explore flat files (CSV Files) using SQL commands, from SQL Server Management Studio (SSMS), regardless of where they are stored, without the need to import the data first.



Exploring/Reading CSV Data

Perhaps the most fundamental first step before importing CSV data is the ability to explore flat file content easily and quickly, without having to import the file first into database tables. You can explore most flat files using Excel directly; however, power users need the flexibility of the SQL language to query data, and explore content in a dynamic way, without having to know much about the content of the file beforehand.

With Enzo, you can explore flat file data using SQL commands without having to load the data in a database table first. This gives you the ability to inspect flat file data directly using SELECT statements, even if the file is stored on a remote system.

Enzo allows you to query files located on Azure Blobs, S3 Buckets, Dropbox, GoogleDrive, FTP sites, OneDrive, OneDrive for Business, SharePoint Document Libraries, local files, Egnyte (coming soon) and HTTP servers. You can also read the content of Excel documents and GoogleSheets as rows and columns using the Xls and GoogleSheet adapters. However these are not technically CSV files, so they are not covered in this blog.

The following screenshot shows you two SQL commands that read flight data from a flat file that was previously exported as a flat file to an FTP server, and another one located in a SharePoint Document Library.

NOTE: We explore saving CSV file data and importing CSV data in SQL Server in other blog posts.

Let’s explore the SQL syntax:

SELECT * FROM CSV.Data@generic WHERE path = 'ftp://mia-arrived.txt'







One way to read flat files with Enzo is to connect to Enzo directly from SQL Server Management Studio (SSMS) and use the SELECT command on the CSV.data@generic table. CSV is the name of the adapter running in Enzo, and data is the table name of the method that can read flat files. Notice the special @ notation that specifies the CSV configuration named ‘generic’ will be used when reading the file.

Generally speaking, a CSV configuration setting specifies the layout of the file, and optionally the list of expected headers with data types. When headers are not specified, they are inferred automatically with a “string” data type.



NOTE: While not specifying headers (Column Names) provides maximum flexibility for reading and discovering flat files, it may be necessary to build the headers (Columns tab) when querying a file through Linked Server calls or to improve performance. See the Linked Server section for more details.







The 'generic' configuration setting in this example expects tab delimiters, a header row, and no quoted fields. These are the most important settings of a configuration setting.

Optionally, you can specify a sample file (Default File Path) to make it easier for Enzo to identify the expected headers. If the path argument in the previous SQL command is not specified, Enzo will read the data from the file specified in Default File Path.

The path parameter is used to indicate the location of the file to read from when the Default File Path is not provided, or when you would like to override the Default File Path. More details on how to construct this parameter later.

As you can see from the SQL commands provided, Enzo allows you to specify both the CSV file format and the file location dynamically. In other words, you can start exploring file content quickly regardless where they are stored. Behind the scenes, Enzo first downloads the file to a local temporary file, extracts its content dynamically, and deletes the local temporary file when the operation is complete. Because the content of the file could change at any time, Enzo will download the file at every request.







While this is may work well for smaller files, it could be slow for larger ones; we will review large file access shortly.

Note that if you do not have SQL Server Management Studio installed locally, you can use Enzo Manager’s Data Explorer to query Enzo. Data Explorer is found under the Tools menu; multiple windows can be opened.



Exploring without a CSV Configuration

In some cases you may want to explore a raw file without knowing ahead of time the kind of data you may find. To account for this scenario, you can use a dynamic definition specified as a JSON document, or use Session Variables.

NOTE: Dynamic definitions without a CSV configuration does not work through Linked Server connections.

To use a dynamic definition as a JSON document, simply provide a JSON string as the first parameter. Enzo will understand this isn’t a normal CSV definition, and instead will use the settings provided in the JSON file. In the example below, the file will be read assuming a header row exists, with a tab delimiter, and no quoted identifiers.

SELECT * FROM CSV.Data WHERE 
  definitionName = '{"FirstRowNames":true, "Delimiter":"\t", "QuotedText":false}'
  AND path = 'ftp://mia-arrived.txt'
        

The following parameters can be specified as part of the JSON document (JSON documents are case-sensitive):

FirstRowNames True when the first row contains headers boolean
QuotedText True when values are surrounded by double-quotes boolean
Delimiter The character used as the field delimiter string
MaxRows Maximum number of rows to read int
MaxErrors Maximum number of errors before stopping a read operation int
ForceStringDataTypes True when all data types should be forced as a string boolean
Sample Number of rows to read to validate inferred data types int
secondaryDefinition The secondary configuration setting to use when accessing other adapters string






Another option, sometimes easier to use, is to set Session Variables ahead of time. Session Variables only work during the session established, which is not available through Linked Server. For example, the following sets the same settings as those used in the JSON document; the difference is that since neither a configuration setting is provided, nor a dynamic JSON document, the session variables are used instead. The Session Variables only need to be set once for a login session.

EXEC CSV.SetVar 'DefaultFirstRowNames', 1
EXEC CSV.SetVar 'DefaultColumnDelimiters', '\t'
EXEC CSV.SetVar 'DefaultQuotedText', 0

SELECT * FROM CSV.Data WHERE path = 'ftp://mia-arrived.txt'



The following Session Variables are available:

DefaultFirstRowNames True when the first row contains headers boolean
DefaultQuotedText True when values are surrounded by double-quotes boolean
DefaultColumnDelimiters The character used as the field delimiter string
DefaultMaxRows Maximum number of rows to read int
DefaultMaxErrors Maximum number of errors before stopping a read operation int
DefaultStartingRow Number of rows to skip when reading the file int
ForceStringDataTypes True when all data types should be forced as a string boolean
Sample Number of rows to read to validate inferred data types int




Local Cache

When working with larger files, or when access speed matters most, it may be best to download the remote CSV file locally first using Enzo, and then query the file locally. There are two primary ways to do this:

  • Download the File: use Enzo to download a temporary file and query the local file
  • Create an Enzo snapshot: the snapshot can be refreshed periodically

NOTE: Creating a snapshot is a more advanced topic, and is not covered in this blog.

To download a temporary copy of the file, use the adapter’s download handler directly. For example, call the DownloadFile handler on the FTP adapter; if you do not specify a file name, a temporary file will be created. Some adapters may have a different handler to perform the download operation.



The operation returns the name of the file that was downloaded, along with its full path. Once the remote file has been downloaded locally, you can run the SELECT command on the temporary file directly. You do not need to provide the full path to this command; if not provided, it will assume that the file is located in the internal Enzo TEMP directory.



To delete a temporary file, use the DELETE SQL command using the Files adapter (the full path to the file is required for this call):

DELETE FROM Files.files WHERE
            fspec = 'C:\Users\hroggero\AppData\Local\Temp\ceea8f50-215a-4069-9afb-7207ee772e7c.tmp'



Using the Path Argument

One of the main benefits of using the SQL command described previously is that you can specify the location of the flat file dynamically using the path argument, regardless of where the file is stored.

The path argument accepts a URI notation in this format:

system[@container]://path/.../filename.extension 
Or
\\server\path\...\filename.extension
Or
C:\tmp\...\filename.extension

In the above, system can be any of the following: sharepoint, ftp, dropbox, azurestorage, s3, onedrive365, onedrive, googledrive, egnyte (coming soon) or HTTP. Except for HTTP, system is the name of the adapter that will be used to read the file; as a result it is necessary to have the adapter installed through the Market Place. Most adapters must first be configured so that Enzo can call the remote system with the appropriate credentials.

The container argument is needed for sharepoint, azurestorage and s3; it is not accepted by other systems. For SharePoint, this represents the SharePoint Document Library to use; for Azure Storage this is the name of the Blob Container (the $root container will be assumed if not provided), and for S3 the Bucket name. The remainder of the path argument is an absolute path from the root folder.







As an example, the FTP adapter is configured with access to an FTP server, and specifies the security credentials. These credentials are securely stored in Enzo in an encrypted format.

The name of the FTP configuration setting is called defaultPROD in this example; note that the (*) next to it means it is the default setting for the ‘sa’ login.

If you have multiple FTP configuration settings, and would like to use a setting called ftpconfigTest with the SQL command, and it is not marked as the default setting, you can do so by adding it after the name of the CSV setting, separated by a comma, as such (note that you need to use the square brackets since this represents the full object name):

SELECT * FROM CSV.[Data@generic,ftpconfigTest] WHERE path = 'ftp://mia-arrived.txt'

As we can see from this SQL command, it becomes simple to read from any flat file, and specifying both the flat file specification and the storage system configuration setting dynamically.



Zip File Access

In many cases, flat files are stored compressed in Zip documents. Zipped documents can store different files, contain a directory structure, and require a password for access. Enzo allows you to read from compressed documents and folders directly; the name of the file must end with a ZIP extension.







One of the key features of reading from a Zip document is the ability to search content in multiple files, using a search file pattern. For example, the following query will return data from three CSV files in a single call, from a Zip document stored in Dropbox, by specifying the filePattern argument. If a password is necessary to open the file, it should be stored in the CSV Configuration Setting directly.

SELECT * FROM CSV.Data@firestations WHERE 
	path='dropbox://Personal/usfa-files.zip'
	AND filePattern = '*.txt'



Linked Server Access

The above examples show you how to read flat files when connecting directly to Enzo Server. However, you may need to read flat files using Linked Server when accessing files from a trigger, or a database view. You can access CSV files through a Linked Server connection to Enzo using two mechanisms:

  • EXEC command (ex: EXEC [localhost,9550].bsc.CSV.ReadCSV ‘generic’, …)
  • SELECT command (ex: SELECT * FROM [localhost,9550].bsc.csv.read@generic … )







Due to some of the limitations imposed by Linked Server, either a Default File Path value is provided so that column names can be extracted, or the list of Columns must be specified as part of the configuration setting. Whenever possible, specifying Columns is preferred for performance reasons and to enforce data types.

You can quickly add Columns to a configuration setting by letting Enzo parse the file; data types are automatically inferred whenever possible; they can also be modified at any time. The sampling option tells Enzo to try a number of records to test the data types identified in the first record; if another data type is found, the column will be set to varchar(max) automatically.







In this example, the arrivedMIAFTP configuration setting has been configured with columns. When columns are added to a configuration setting, or when a Default File Path is provided, you can access the CSV file through Linked Server calls.

In this example, Enzo is running locally and has been registered as a Linked Server called localhost,9550. For more information about how to setup Linked Server with Enzo, see this help documentation.



Conclusion

This blog demonstrates how to read and explore CSV flat files stored in remote systems, using SQL commands in SQL Server Management Studio. Options include the ability to query remote files dynamically, compressed documents, download files locally for performance, and query through Linked Server connections to Enzo.

Enzo offers many options for quickly and easily exploring flat files using familiar tools such as SQL Server Management Studio or Enzo Data Explorer, part of Enzo Manager.

To try Enzo at no charge, download the latest edition of Enzo Server on our website.