howto

Azure Data Lake Store

Applies to DSS 4.2.x and above | May 04, 2018

This article presents different options to connect to Azure Data Lake Store, and use data stored there directly from Dataiku DSS.

Overview

Microsoft Azure Data Lake Store (ADLS) is an enterprise-wide hyper-scale repository for big data analytic workloads. Azure Data Lake enables you to capture data of any size, type, and ingestion speed in one single place for operational and exploratory analytics.

ADLS offers fully HDFS-compatible APIs, and as such it can for example be used as an “HDFS” Dataset in Dataiku DSS (but other options are available). Please see below for more details.

Prerequisites

You will need an Azure account and an Azure Data Lake Store account. Please refer to Azure documentation to get started.

Using ADLS from Azure HDInsight

Azure HDInsight is a fully managed Hadoop/Spark cluster running on Azure, and configured to connect and interact natively with ADLS.

Dataiku DSS can be installed on an HDInsight edge node, and thus can benefit from the native ADLS integration. It is possible to use:

  • an HDInsight cluster provisioned with ADLS as the primary storage
  • an HDInsight cluster provisioned with ADLS as an additional storage

From the Azure portal, create a new HDInsight cluster, then select your prefered storage option for ADLS. The following example corresponds to ADLS configured as the primary storage system:

Configuring ADLS in HDInsight

Wait for the cluster to be created and DSS to be configured. When ready, log into DSS.

ADLS being fully HDFS compatible, it can be used as an “HDFS Dataset” in DSS.

If ADLS has been configured as the primary storage of the cluster, the automatically generated DSS Connection called “hdfs_root” will let you browse the cluster “internal” filesystem stored on ADLS:

Browsing cluster filesystem

In many cases though, the data and files of interest will be stored in separate ADLS directories, disjoint from the internal cluster filesystem, as illustrated in the picture below:

Browsing the ADLS account

To access data stored in the ADLS outside out of the HDInsight cluster internal filesystem:

  • Create a new "HDFS" dataset New HDFS Dataset
  • Enter the root path URI of the ADLS directory you want to use (can be find in the ADLS Data Explorer in the Azure portal) Configuring the connection For ADLS, the URI will ressemble the following scheme:
    adl://your-adls-account.azuredatalakestore.net/directory-name

At this point, connectivity to ADLS is configured, and you will be able to create DSS Datasets by browsing the files in the Connection.

Please note that it can be a good practice to:

  • configure a “read-only” ADLS (“HDFS”) connection pointing at the root path URI of the ADLS main data repository
  • configure a “read/write” ADLS (“HDFS”) connection pointing to a specific directory that can be used by DSS to write and store its Datasets

From there, the user will have the ability to build Flows mixing data stored in ADLS and jobs executed by HDInsight (Spark or MR):

An ADLS and HDI Flow

Using the ADLS Plugin

It is also possible to access and read data stored in ADLS without going through HDInsight and “HDFS” connection in DSS. This is a more “lightweight” solution to address specific cases where a user may want to access small to medium datasets, with no need for scalable Spark or MR jobs to process them.

To facilitate this kind of approach, we have released a DSS Plugin available in the public Plugin store: the ADLS Plugin.

This Plugin contains a custom “filesystem provider” for DSS that can be used to create a new Dataset pointing at files stored in an ADLS account, read by leveraging the Azure azure-datalake-store Python library - which provides an abstraction of WebHDFS API for ADLS, and allows for interacting with the filesystem.

Dataset created from this Plugin can then be processed using the DSS Streaming Engine or Python/R Recipes:

An ADLS Flow

Detailed explanations on how to use this Plugin can be found on the Plugin documentation page, or directly in the Github repository storing its source code.

Final comments

  • Connecting to ADLS from HDInsight is the prefered method, as it provides full integration with DSS features and can be processed at scale using Spark or MapReduce.
  • It is also possible to connect to ADLS from a non-HDInsight cluster (including on premises), by configuring your cluster properly. See for instance: