Azure Data Lake Store

Azure Data Lake Store (ADLS) is an enterprise-wide hyper-scale repository for big data analytic workloads. Azure Data Lake enables you to capture data of any size, type, and ingestion speed in one single place for operational and exploratory analytics.

This DSS Plugin provides a custom DSS file system provider to read data from Azure Data Lake Store

Important remark:
This Plugin provides a convenient way to read small to medium scale datasets from ADLS.
To benefit from the full features of DSS, you may want to access to ADLS as an "HDFS" dataset instead,
as described in this article.

Dataiku DSS screenshot showing a Flow using ADLS

Build Flows with ADLS data.

Plugin information

Version1.0.0
AuthorDataiku
Released2018-05-04
Last updated2018-05-04
LicenseApache Software License
Source codeGithub repository

Obtaining credentials to interact with the ADLS API

To interact with the ADLS APIs, we are using here "service-to-service" authentication. In case of questions, please refer to the official Azure documentation. To be able to authenticate against the ADLS APIs, the following credentials are required:

  • a Client ID (i.e the registered App ID)
  • a Client Secret (i.e the registered App secret key created in the Azure portal)
  • a Tenant ID (i.e the Active Directory ID)

The App will need at least read-access on the ADLS directories you want to access.

Dataiku DSS and Azure Data Lake Store

ADLS is a fully-compatible HDFS-like file system for DSS. As such, it can be used directly with systems such Azure HDInsight (which can be configured to automatically use ADLS as primary or secondary storage) or even on-premises or non-Azure managed clusters (see for instance this blogpost).

The Plugin here does not require Hadoop or Spark integration to interact with ADLS. It addresses the simple case where DSS users simply wants to connect to ADLS, browse its directories, and read data into a regular DSS Dataset for further processing. It's a "lightweight" integration for simple use cases.

The Plugin relies on the Azure azure-datalake-store Python library.

How to use

The Plugin contains a custom "file system provider" for DSS that will let you:

  • Connect to an ADLS account
  • Browse its files
  • Create a DSS Dataset from these files

To use this Plugin, start by installing it, then:

  • Create a new Dataset and look for Azure Data Lake Store
    Dataiku DSS screenshot showing a new ADLS dataset

    Create a new ADLS Dataset

  • Enter your credentials and browse the files on ADLS
    Dataiku DSS screenshot showing the settings of a new ADLS dataset

    Configure your ADLS Dataset

  • Create your ADLS Dataset
    Dataiku DSS screenshot showing the settings of an ADLS dataset

    Start using your ADLS Dataset

You have now a regular DSS Dataset pointing at files stored on Azure Data Lake Store, and can be used as a regular DSS Dataset (in Recipes, Analyses or Flows).

Additional comments

When using this Plugin, please remind that:

  • it can read regular DSS file formats only, as described in the documentation
  • it is a "read-only" Plugin, it does not support writing to ADLS
  • it's a "lightweight" connector to quickly access data in ADLS and process it using the DSS Streaming Engine or Python/R Recipes. If you need to process large datasets stored in ADLS, you may need to use a Hadoop or Spark cluster (such as Azure HDInsight) instead.

Additional instructions and source code are available in our Github repository