Azure Data Lake Store

Reading data from Azure Data Lake Store (ADLS)

Azure Data Lake Store (ADLS) is an enterprise-wide hyper-scale repository for big data analytic workloads. Azure Data Lake enables you to capture data of any size, type, and ingestion speed in one single place for operational and exploratory analytics.

This DSS Plugin provides a custom DSS file system provider to read data from Azure Data Lake Store

Important remark:
This Plugin provides a convenient way to read small to medium scale datasets from ADLS.
To benefit from the full features of DSS, you may want to access to ADLS as an “HDFS” dataset instead,
as described in this article.

Build Flows with ADLS data
Build Flows with ADLS data.

Plugin Information

Version 1.0.1
Author Dataiku
Released 2018-05-04
Last updated 2018-05-04
License Apache Software License
Source code Github repository

Obtaining Credentials To Interact With The ADLS API

To interact with the ADLS APIs, we are using here “service-to-service” authentication. In case of questions, please refer to the official Azure documentation. To be able to authenticate against the ADLS APIs, the following credentials are required:

  • a Client ID (i.e the registered App ID)
  • a Client Secret (i.e the registered App secret key created in the Azure portal)
  • a Tenant ID (i.e the Active Directory ID)

The App will need at least read-access on the ADLS directories you want to access.

Dataiku DSS And Azure Data Lake Store

ADLS is a fully-compatible HDFS-like file system for DSS. As such, it can be used directly with systems such Azure HDInsight (which can be configured to automatically use ADLS as primary or secondary storage) or even on-premises or non-Azure managed clusters (see for instance this blogpost).

The Plugin here does not require Hadoop or Spark integration to interact with ADLS. It addresses the simple case where DSS users simply wants to connect to ADLS, browse its directories, and read data into a regular DSS Dataset for further processing. It’s a “lightweight” integration for simple use cases.

The Plugin relies on the Azure azure-datalake-store Python library.

How To Use

The Plugin contains a custom “file system provider” for DSS that will let you:

  • Connect to an ADLS account
  • Browse its files
  • Create a DSS Dataset from these files

To use this Plugin, start by installing it, then:

  • Create a new Dataset and look for Azure Data Lake Store
Create a new ADLS Dataset
Create a new ADLS Dataset
  • Enter your credentials and browse the files on ADLS
Configure your ADLS Dataset
Configure your ADLS Dataset
  • Create your ADLS Dataset
Start using your ADLS Dataset
Start using your ADLS Dataset

You have now a regular DSS Dataset pointing at files stored on Azure Data Lake Store, and can be used as a regular DSS Dataset (in Recipes, Analyses or Flows).

Additional comments

When using this Plugin, please remind that:

  • it can read regular DSS file formats only, as described in the documentation
  • it is a “read-only” Plugin, it does not support writing to ADLS
  • it’s a “lightweight” connector to quickly access data in ADLS and process it using the DSS Streaming Engine or Python/R Recipes. If you need to process large datasets stored in ADLS, you may need to use a Hadoop or Spark cluster (such as Azure HDInsight) instead.

Additional instructions and source code are available in our Github repository

Get the Dataiku Data Sheet

Learn everything you ever wanted to know about Dataiku (but were afraid to ask), including detailed specifications on features and integrations.

Get the Data Sheet