howto

Dataiku and Microsoft HDInsight Integration

March 13, 2017

Microsoft HDInsight is a fully managed Hadoop and Spark cloud service running on Microsoft Azure. HDInsight offers the ability to run third party applications using a very quick and integrated deployment process, right from the Azure portal.

In this tutorial, we’ll help you get started with running Dataiku on top of HDInsight.

## Prerequisites ##

You need a Microsoft Azure account, and sufficient credentials to access to the Azure portal and manage HDInsight clusters. You may need to sign up first; in this case please follow these instructions.

Installing Dataiku on HDInsight

Dataiku is available as a third party application directly from the HDInsight Azure portal, and can be installed either at cluster creation time, or after the cluster has been created.

When creating a new HDInsight cluster

The process of installing Dataiku on a new HDInsight cluster is straightforward:

  • From the Azure portal, click the large green “plus” sign and search for HDInsight on the Azure Marketplace. Click “Create”.
  • Select the “Custom” view of the HDInsight settings, an “Applications” tab will show up
  • Fill in your cluster credentials, and select the appropriate cluster type. Dataiku is available for “Hadoop” and “Spark” clusters, with HDI 3.4 or 3.5, and Spark 1.6.
  • Under the Storage section, select “Azure Storage” as a primary storage type, then select your storage account and default container. Azure Data Lake is not supported.
  • Under “Applications”, select “DSS on HDInsight”, accept the legal terms, and click next.
  • Fill in the remaining cluster settings (cluster size and advanced settings), then create your cluster Selections when creating a new HDInsight cluster

Spinning up the cluster and configuring Dataiku may take 30 to 50 minutes approximatively, depending on your settings. Once the process is complete, you will be taken to the HDInsight main page.

For an existing HDInsight cluster

The installation process for Dataiku is also available for an existing cluster. You’ll just need to go the “Applications” tab on your HDInsight cluster main page, and add Dataiku from there.

Using the base ARM template

If you need more control over the Dataiku installation and configuration on an existing HDInsight cluster (to install a specific version or to select a different instance size for example), another supported method is to use directly the underlying Azure Resource Management template, which can be found in this Github repository (see azuredeploy.json). In thise case, you’ll just need to use the “Template” feature of the Azure portal and copy/paste the content of the json file, and deploy it to get access to the full list of parameters. Selections for the base Azure Resource Management template

Working with Dataiku on HDInsight

After the installation process has completed, Dataiku is running on a new edge node managed by the HDInsight cluster, and can then be used as with any regular Hadoop cluster. Architecture diagram for the Dataiku and HDInsight integration

Accessing Dataiku

The URL to connect to Dataiku can be found directly from the HDI main page on the Azure portal.

Under Applications, click on “DSS on HDInsight”, the on the “portal” URL. An initial authentication window will pop up, requiring to enter your cluster credentials (as defined during the set up phase, similar to the Ambari credentials). You will finally be taken to the Dataiku login page, where you’ll be able to choose between a 2-weeks Enterprise Edition trial, or entering your own license.

Storage layer

This is one of the key differences compared to a non HDInsight cluster.

Microsoft HDInsight relies on Azure Blob Storage as its primary storage, and Azure Blob Storage will then be used as the default distributed storage by Dataiku (“wasb” protocol). For the end user, the behavior will be the same as a regular HDFS dataset in Dataiku, see the main documentation for more information.

Dataiku creates two initial connections on HDInsight:

  • a connection called “hdfs_root” which provides read-only access to the root level of the data “Container” of the Azure Storage Account, as defined when the cluster is created (this is equivalent to browsing the top level directory using “hdfs dfs -ls /” on the HDInsight cluster)

  • a connection called “hdfs_managed” which points to a subdirectory owned by the dataiku user and called “dss_managed_datasets”, with read/write access for datasets fully managed by Dataiku.

Even if “hdfs” is part of the connections name to remain consistent with other on premise or non managed Hadoop deployments, they both rely on the wasb protocol and points to actual Azure Blob Storage locations.

Processing layer

For Hadoop-only HDInsight clusters, Dataiku will offer the following capabilities:

  • ability to run Visual Recipes using MapReduce / Tez (Hive) as the execution engine
  • ability to write and run Coding Recipes in Hive (SQL) or Pig
  • ability to perform interactive Hive queries using SQL Notebooks

In addtion, for Spark-enabled HDInsight clusters, Dataiku will also offer the following capabilities:

  • ability to run Visual Recipes using Spark as the execution engine
  • ability to write and run Coding Recipes in Scala, PySpark, SparkR and SparkSQL
  • ability to perform interactive analysis of large datasets using Jupyter Notebooks wih Spark kernels (Scala, Python, R)
  • ability to train machine learning models in a distributed fashion using Spark MLLib
  • ability to train machine learning models in a distributed fashion using H2O, through Sparkling Water

Please note that Hive makes use of HiveServer2, which configuration is automatically set up at deployment time.

Extending HDInsight using Dataiku

In addtion to leveraging its key Hadoop and Spark capabilities listed above, it is easy to integrate HDInsight in a larger data platform.

Getting data in and out HDInsight

Using Dataiku built-in or custom connectors, the process of getting data in HDInsight is easy. The following sample scenarios are possible:

  • getting data from legacy and/or operational SQL database (Microsoft SQL Server for instance) into HDInsight
  • getting log files stored into Azure Blob Storage (or other cloud provider systems) into HDInsight
  • getting semi-structured data from NoSQL stores into HDInsight
  • collecting data from HTTP API’s and storing the outputs in HDInsight

Once ingested in HDInsight, Dataiku will make it easy to blend all these data sources together.

In addtion to getting data into HDInsight, it is equally easy to store the outputs of HDInsight jobs in external systems (SQL, NoSQL, files…) for further processing.

Advanced analytics using Python or R

Dataiku can also be used to perform advanced analytics, including machine learning, by leveraging R and Python running in the main memory of the HDInsight edge node (hosting Dataiku). Users can leverage these programming languages either through Coding Recipes, or Jupyter Notebooks for interactive analytics. In addtion, machine learning models can be trained and analyzed using a Python “in-memory” backend.

Dataiku then makes it easy to create complete data science pipelines blending MapReduce or Spark base jobs running on HDInsight to process large datasets, and high-end Python or R jobs to perform predictive modeling and machine learning.

Additional information

  • Microsoft HDInsights is built upon the Hortonworks Data Platform, so the Dataiku end-user experience will be very similar for these two platforms
  • The R integration offered through Dataiku is based on CRAN R, not Microsoft R
  • When using Dataiku over HDInsight, the platform administrators have the responsability to perform regular backups of the Dataiku internal data directory to prevent data loss if the cluster shuts down.
  • As of March 2017, the only supported way to connect Dataiku to HDInsight is through an edge node fully managed by HDInsight itself (i.e the procedures described above). There is no supported / official way to configure an existing “external” Azure VM as an edge node of HDInsight.