howto

Using Microsoft HDInsight and Dataiku to predict credit default

March 27, 2017

This article will show how to build a predictive model for credit scoring using Microsoft HDInsight and Dataiku. We’ll build a very simple workflow leveraging only visual recipes for both data preparation and machine learning (no coding required), and running entirely over Spark.

Prerequisites

You’ll need access to a Microsoft HDInsight clutser, configured as HDI 3.5 with Spark 1.6, and Dataiku installed as a third-party application. More details can be found in the reference documentation. You’ll also need a Data Scientist access in Dataiku to be able to create Machine Learning models.

Source data

The source dataset used in this article is coming from the “Give Me Some Credit” Kaggle challenge. If you do not have a Kaggle account, please sign up first.

You’ll need to download the “cs-training.csv” and “cs-test.csv” files.

Predicting credit default with HDInsight and Dataiku

Preparing your environment

Once HDInsight is up and running, start with creating a new project in Dataiku, which you can name “Credit Scoring Demo” for instance. Also, as a good practice, it is better to add pictures and descriptions to your project, so that the other members of your team can easily know what it is about. Credit scoring project home

Creating the first Dataiku Dataset

From the project homepage, hit the “Import your first dataset” button to create your first Dataiku Dataset, then use the “Upload your files” functionality under “Files”. Start with uploading the “cs-training.csv” file, and when done, click on “Preview”. You may notice that the the headers of columns are not properly used (because there is an index column at the beginning with no header), so just click the “Parse next line as column headers” tick box to solve the issue. Preview of cs-training dataset Click “Create”. Your Dataset is now ready to be used in Dataiku.

Pushing your data to HDInsight

Making your data available to HDInsight is fairly straightforward. Create a new “Sync” recipe, and select “hdfs_managed” as the target destination. Create the Recipe, then click the green “Run” button at the bottom left of the next screen. Using a Sync recipe to push local data to HDInsight Go to the “Flow” menu. You can now see that your data has been transfered from the local Dataiku filesystem to HDInsight in a couple of clicks. In our example, the dataset residing in HDInsight is called “cs_training_wasb”.

IMPORTANT NOTE:
You just used the “hdfs_managed” connection to write to HDInsight. This is a default name provided by Dataiku, but in the case the HDInsight, HDFS means in fact the Azure “Wasb” primary storage system for HDInsight, on Azure Blob Storage. You can verify that your data has been written to Azure Blob Storage by browsing the container storing your HDInsight data via the Azure portal: Verifying the data was written to Azure Blob Storage

Preparing the data

A few steps are required to make your dataset ready for Machine Learning. Open the newly created dataset in HDInsight (“cs_training_wasb”), and click the “Lab” button at the top right. Create a new “Visual Analysis”. The data preparation is very simple:

  • Remove the unnecessary “col_0” column
  • Clear all cells where value is “NA”
  • Set the meaning of the “SeriousDlqin2yrs” column to Text

Data preparation script

When ready, deploy the preparation script, store again into “hdfs_managed” (i.e Azure Wasb), and select “Parquet” as the storage format. Click “Deploy” (without building the Dataset yet).

The Visual Data Preparation recipe can now leverage HDInsight and be run on Spark. Click the small cog under the “Run” button at the bottom left, and select “Spark” as the execution engine. Running the Prepare script using Spark as the engine

Click “Run” and observe the progress of your Spark job in the “Jobs” menu of Dataiku. The complete Yarn and Spark logs will be surfaced there, allowing to track progress and debug your jobs in case of issues. Wait for the job to complete.

Creating the predictive model

The data is now ready for machine learning. Open the last dataset created (“cs_training_wasb_prepared” if you left the default name), click on “Lab” again, but select “Quick model” this time, then select “Prediction”. Select “SeriousDlqin2yrs” as the target variable, then “MLLib” as the backend for training the model: Choosing the MLLib backend for training the model

Create the analysis, and before actually training the model, click “Settings” at the top right of your screen. Review the various underlying settings of your model, then in the Algorithms section, add a “Random Forest” with 80 trees to the list of algorithms to test. Choosing the algorithms settings for training the model

Click “Save”, then “Train”. The entire process of training the Machine Learning to predict credit default will be delegated to HDInsight and more specifically to MLLib, the Spark set of functionalities dedicated to distribued machine learning. Dataiku will also leverage Spark for any preprocessing required before being able to train the model.

Wait for your models to train. Once ready, you may observe very different performances between the logistic regression and the random forest. The random forest has a much higher AUC, indicating a better performance. Model summary output

Open the random forest results. You can now analyze insights related to the model, for instance which variables contribute the most to predicting credit default, or assess the performance of your classifier through various screens: ROC curve for random forest model

Deploying the predictive model

The model is now ready to be operationalized. Start with deploying your model to the Flow. Deploying a model to the Flow

To simulate how using the model would like, you can:

  • upload the “cs_test.csv” file to Dataiku
  • reuse the initial Sync recipe of the training dataset using the “Copy” function
  • reuse the initial Visual Data Preparation recipe of the training dataset using the “Copy” function

Please do not forget to change the inputs / outputs of each of your recipes. When done, you can use the model through the “Score” recipe to get the predictions for the test set. Your final workflow may look like the following: Flow with the Score recipe included to score the test dataset

From there, 3 options are available:

  • publishing some Insights related to the model on the Dataiku Dashboard
  • produce predictions in batch mode on a scheduled basis using Dataiku Scenarios
  • exposing the predictions as an API service using the Dataiku Scoring node.

Please consult the corresponding documentation for more information on each of these options.

Your HDInsight-powered machine learning workflow is now ready. Dataiku acts as the top, visual layer to build the pipeline, while Spark is used as the execution engine of this pipeline. This could be easily verified from the Flow, where the “Engines” view highlights each of the backends used for the pipeline: Flow with the Engines view turned on to see which backend engine does the processing at each step

Wrap up

This article showed how a MLLib predictive model of credit default could be trained and deployed using only a few visual components in Dataiku: Sync, Visual Data Preparation and Visual Machine Learning. Other advanced options remain of course available to develop more sophisticated workflows, making use for instance of Code Recipes, either MapReduce based (Hive, Pig) or Spark based (Scala, PySpark, SparkR or SparkSQL).

Leveraging the integration between Dataiku and Microsoft HDInsight opens up the ability to quickly develop and run a complete data science workflow at scale in the cloud, in minutes.