Snowflake Plugin

Snowflake is a data warehouse built for the Cloud, offering the following characteristics:

  • Performance: Snowflake easily scales to multiple petabytes and performs up to 200x faster than other systems.
  • Concurrency: multiple groups can access the same data at the same time without impacting performance.
  • Simplicity: a fully managed, pay-as-you-go solution that stores, integrates and analyzes all your data.

Snowflake is built on top of the Amazon Web Services (AWS) cloud.

This DSS Plugin offers the ability to quickly load data stored in S3 into Snowflake.

Dataiku DSS screenshot showing a Flow using Snowflake

Process your DSS Datasets with Snowflake.

Plugin information

Version0.1.1
AuthorDataiku
Released2018-05-22
Last updated2018-05-22
LicenseApache Software License
Source codeGithub repository

Dataiku and Snowflake

Dataiku and Snowflake are two complementary solutions. Snowflake is a high performance, scalable data warehouse optimized for analytics workload, and can be used as "backend" computation engine for machine learning workflows by DSS.

A typical ML project lifecycle could be:

  • users have raw data are stored in Amazon S3 ("data lake")
  • they process and push them to Snowflake using Dataiku DSS
  • they perform initial data exploration (visualization, descriptive statistics...) using in-database processing (Dataiku DSS pushes down the calculations to Snowflake) over the entire Datasets (no sampling required)
  • they perform "features engineering" (i.e aggregating and reshaping the raw data to make them usable by ML algorithms) using either DSS "Visual" or "Coding" recipes, which again are fully pushed down to Snowflake as SQL statements
  • they train and deploy a ML model, and the results (for example, scores or results of a predictive model) are stored in Snowflake (or back in S3) in an automated way

Dataiku already provides a Snowflake connector. This connector uses JDBC and will let the user easily read data stored in Snowflake, and run Visual or Coding Recipes. One caveat is that the connector in its current form may suffer performance issues when writing data back into Snowflake as it uses regular "INSERT INTO" statements.

This Plugin emulates a DSS "Sync" Recipe but leverages the builtin Snowflake mechanisms for fast data loading ("COPY INTO") of data stored in Amazon S3 (using a S3 DSS Dataset).

Prerequisites

To use this Plugin, you will need:

This Plugin comes with a dedicated Python code environment, managing the dependencies on the Snowflake Python connector. Note that this connector may require the following Linux libraries to exist in the host machine:

  • Libssl-dev
  • Libffi-dev

Finally, the Plugin has been tested with Python 3.6 and requires a valid Python 3.6 installation on the machine (the Plugin code environment is restricted to Python 3.6).

How to use

The Plugin contains a custom Python Recipe emulating a DSS Sync Recipe taking a S3 Dataset as input and bulkloading it into a Snowflake connection using a COPY INTO statement.

  • Install the Plugin
  • Create a "S3" Dataset (i.e a Dataset which has been written into a S3 connection)
  • Add a custom Snowflake recipe taking the S3 Dataset as input, and outputting a DSS in your Snowflake connection
    Dataiku DSS screenshot showing Snowflake Plugin input/output settings

    Snowflake Plugin input/output settings in the UI

  • Enter your AWS credentials (in the Plugin UI or in DSS Variables)
    Dataiku DSS screenshot showing Snowflake Plugin settings

    Snowflake Plugin direct settings in the UI

  • Run the recipe

The Plugin requires 2 input parameters: the AWS Access Key and Secret Key of the S3 bucket used with the Plugin. These parameters must be either:

  • Set in the Plugin UI (see screenshot above)
  • Defined as DSS Project or Global Variables

When the credentials are stored in DSS Variables, the Plugin expects them to be in the following format:

{
  "snowflake": {
    "aws_access_key": "your-aws-access-key",
    "aws_secret_key": "your-aws-secret-key"
  }
}

If the credentials are left blank in the Plugin UI, DSS will look first in the Project Variables, then in the Global Variables.

Once the Recipe run, a new Dataset should have been created with data stored in Snowflake (can be checked by looking at the Dataset settings, using a SQL Notebook, or directly in the Snowflake UI).

Dataiku DSS screenshot showing a Snowflake Dataset settings

Snowflake Dataset settings

Additional instructions are available in our Github repository