Microsoft HDInsight is a fully managed Hadoop and Spark cloud service running on Microsoft Azure. HDInsight offers the ability to run third party applications using a very quick and integrated deployment process, right from the Azure portal.
In this tutorial, we’ll help you get started with running Dataiku on top of HDInsight.
You need a Microsoft Azure account, and sufficient credentials to access to the Azure portal and manage HDInsight clusters. You may need to sign up first; in this case please follow these instructions.
Dataiku is available as a third party application directly from the HDInsight Azure portal, and can be installed either at cluster creation time, or after the cluster has been created.
The process of installing Dataiku on a new HDInsight cluster is straightforward:
Spinning up the cluster and configuring Dataiku may take 30 to 50 minutes approximatively, depending on your settings. Once the process is complete, you will be taken to the HDInsight main page.
The installation process for Dataiku is also available for an existing cluster. You’ll just need to go the “Applications” tab on your HDInsight cluster main page, and add Dataiku from there.
If you need more control over the Dataiku installation and configuration on an existing HDInsight cluster (to install a specific version or to select a different instance size for example), another supported method is to use directly the underlying Azure Resource Management template, which can be found in this Github repository (see azuredeploy.json). In thise case, you’ll just need to use the “Template” feature of the Azure portal and copy/paste the content of the json file, and deploy it to get access to the full list of parameters.
After the installation process has completed, Dataiku is running on a new edge node managed by the HDInsight cluster, and can then be used as with any regular Hadoop cluster.
The URL to connect to Dataiku can be found directly from the HDI main page on the Azure portal.
Under Applications, click on “DSS on HDInsight”, the on the “portal” URL. An initial authentication window will pop up, requiring to enter your cluster credentials (as defined during the set up phase, similar to the Ambari credentials). You will finally be taken to the Dataiku login page, where you’ll be able to choose between a 2-weeks Enterprise Edition trial, or entering your own license.
This is one of the key differences compared to a non HDInsight cluster.
Microsoft HDInsight relies on Azure Blob Storage as its primary storage, and Azure Blob Storage will then be used as the default distributed storage by Dataiku (“wasb” protocol). For the end user, the behavior will be the same as a regular HDFS dataset in Dataiku, see the main documentation for more information.
Dataiku creates two initial connections on HDInsight:
a connection called “hdfs_root” which provides read-only access to the root level of the data “Container” of the Azure Storage Account, as defined when the cluster is created (this is equivalent to browsing the top level directory using “
hdfs dfs -ls /” on the HDInsight cluster)
a connection called “hdfs_managed” which points to a subdirectory owned by the dataiku user and called “dss_managed_datasets”, with read/write access for datasets fully managed by Dataiku.
Even if “hdfs” is part of the connections name to remain consistent with other on premise or non managed Hadoop deployments, they both rely on the wasb protocol and points to actual Azure Blob Storage locations.
For Hadoop-only HDInsight clusters, Dataiku will offer the following capabilities:
In addtion, for Spark-enabled HDInsight clusters, Dataiku will also offer the following capabilities:
Please note that Hive makes use of HiveServer2, which configuration is automatically set up at deployment time.
In addtion to leveraging its key Hadoop and Spark capabilities listed above, it is easy to integrate HDInsight in a larger data platform.
Once ingested in HDInsight, Dataiku will make it easy to blend all these data sources together.
In addtion to getting data into HDInsight, it is equally easy to store the outputs of HDInsight jobs in external systems (SQL, NoSQL, files…) for further processing.
Dataiku can also be used to perform advanced analytics, including machine learning, by leveraging R and Python running in the main memory of the HDInsight edge node (hosting Dataiku). Users can leverage these programming languages either through Coding Recipes, or Jupyter Notebooks for interactive analytics. In addtion, machine learning models can be trained and analyzed using a Python “in-memory” backend.
Dataiku then makes it easy to create complete data science pipelines blending MapReduce or Spark base jobs running on HDInsight to process large datasets, and high-end Python or R jobs to perform predictive modeling and machine learning.