This article presents different options to connect to Azure Data Lake Store, and use data stored there directly from Dataiku DSS.
Microsoft Azure Data Lake Store (ADLS) is an enterprise-wide hyper-scale repository for big data analytic workloads. Azure Data Lake enables you to capture data of any size, type, and ingestion speed in one single place for operational and exploratory analytics.
ADLS offers fully HDFS-compatible APIs, and as such it can for example be used as an “HDFS” Dataset in Dataiku DSS (but other options are available). Please see below for more details.
You will need an Azure account and an Azure Data Lake Store account. Please refer to Azure documentation to get started.
Azure HDInsight is a fully managed Hadoop/Spark cluster running on Azure, and configured to connect and interact natively with ADLS.
Dataiku DSS can be installed on an HDInsight edge node, and thus can benefit from the native ADLS integration. It is possible to use:
From the Azure portal, create a new HDInsight cluster, then select your prefered storage option for ADLS. The following example corresponds to ADLS configured as the primary storage system:
Wait for the cluster to be created and DSS to be configured. When ready, log into DSS.
ADLS being fully HDFS compatible, it can be used as an “HDFS Dataset” in DSS.
If ADLS has been configured as the primary storage of the cluster, the automatically generated DSS Connection called “hdfs_root” will let you browse the cluster “internal” filesystem stored on ADLS:
In many cases though, the data and files of interest will be stored in separate ADLS directories, disjoint from the internal cluster filesystem, as illustrated in the picture below:
To access data stored in the ADLS outside out of the HDInsight cluster internal filesystem:
At this point, connectivity to ADLS is configured, and you will be able to create DSS Datasets by browsing the files in the Connection.
Please note that it can be a good practice to:
From there, the user will have the ability to build Flows mixing data stored in ADLS and jobs executed by HDInsight (Spark or MR):
It is also possible to access and read data stored in ADLS without going through HDInsight and “HDFS” connection in DSS. This is a more “lightweight” solution to address specific cases where a user may want to access small to medium datasets, with no need for scalable Spark or MR jobs to process them.
To facilitate this kind of approach, we have released a DSS Plugin available in the public Plugin store: the ADLS Plugin.
This Plugin contains a custom “filesystem provider” for DSS that can be used to create a new Dataset pointing at files stored in an ADLS account, read by leveraging the Azure azure-datalake-store Python library - which provides an abstraction of WebHDFS API for ADLS, and allows for interacting with the filesystem.
Dataset created from this Plugin can then be processed using the DSS Streaming Engine or Python/R Recipes: