Azure Data Lake Store (ADLS) is an enterprise-wide hyper-scale repository for big data analytic workloads. Azure Data Lake enables you to capture data of any size, type, and ingestion speed in one single place for operational and exploratory analytics.
This DSS Plugin provides a custom DSS file system provider to read data from Azure Data Lake Store
This Plugin provides a convenient way to read small to medium scale datasets from ADLS.
To benefit from the full features of DSS, you may want to access to ADLS as an “HDFS” dataset instead,
as described in this article.
|License||Apache Software License|
|Source code||Github repository|
Obtaining Credentials To Interact With The ADLS API
To interact with the ADLS APIs, we are using here “service-to-service” authentication. In case of questions, please refer to the official Azure documentation. To be able to authenticate against the ADLS APIs, the following credentials are required:
- a Client ID (i.e the registered App ID)
- a Client Secret (i.e the registered App secret key created in the Azure portal)
- a Tenant ID (i.e the Active Directory ID)
The App will need at least read-access on the ADLS directories you want to access.
Dataiku DSS And Azure Data Lake Store
ADLS is a fully-compatible HDFS-like file system for DSS. As such, it can be used directly with systems such Azure HDInsight (which can be configured to automatically use ADLS as primary or secondary storage) or even on-premises or non-Azure managed clusters (see for instance this blogpost).
The Plugin here does not require Hadoop or Spark integration to interact with ADLS. It addresses the simple case where DSS users simply wants to connect to ADLS, browse its directories, and read data into a regular DSS Dataset for further processing. It’s a “lightweight” integration for simple use cases.
The Plugin relies on the Azure azure-datalake-store Python library.
How To Use
The Plugin contains a custom “file system provider” for DSS that will let you:
- Connect to an ADLS account
- Browse its files
- Create a DSS Dataset from these files
To use this Plugin, start by installing it, then:
- Create a new Dataset and look for Azure Data Lake Store
- Enter your credentials and browse the files on ADLS
- Create your ADLS Dataset
You have now a regular DSS Dataset pointing at files stored on Azure Data Lake Store, and can be used as a regular DSS Dataset (in Recipes, Analyses or Flows).
When using this Plugin, please remind that:
- it can read regular DSS file formats only, as described in the documentation
- it is a “read-only” Plugin, it does not support writing to ADLS
- it’s a “lightweight” connector to quickly access data in ADLS and process it using the DSS Streaming Engine or Python/R Recipes. If you need to process large datasets stored in ADLS, you may need to use a Hadoop or Spark cluster (such as Azure HDInsight) instead.
Additional instructions and source code are available in our Github repository