This article will show how to build a predictive model for credit scoring using Microsoft HDInsight and Dataiku. We’ll build a very simple workflow leveraging only visual recipes for both data preparation and machine learning (no coding required), and running entirely over Spark.
You’ll need access to a Microsoft HDInsight clutser, configured as HDI 3.5 with Spark 1.6, and Dataiku installed as a third-party application. More details can be found in the reference documentation. You’ll also need a Data Scientist access in Dataiku to be able to create Machine Learning models.
The source dataset used in this article is coming from the “Give Me Some Credit” Kaggle challenge. If you do not have a Kaggle account, please sign up first.
You’ll need to download the “cs-training.csv” and “cs-test.csv” files.
Once HDInsight is up and running, start with creating a new project in Dataiku, which you can name “Credit Scoring Demo” for instance. Also, as a good practice, it is better to add pictures and descriptions to your project, so that the other members of your team can easily know what it is about.
From the project homepage, hit the “Import your first dataset” button to create your first Dataiku Dataset, then use the “Upload your files” functionality under “Files”. Start with uploading the “cs-training.csv” file, and when done, click on “Preview”. You may notice that the the headers of columns are not properly used (because there is an index column at the beginning with no header), so just click the “Parse next line as column headers” tick box to solve the issue. Click “Create”. Your Dataset is now ready to be used in Dataiku.
Making your data available to HDInsight is fairly straightforward. Create a new “Sync” recipe, and select “hdfs_managed” as the target destination. Create the Recipe, then click the green “Run” button at the bottom left of the next screen. Go to the “Flow” menu. You can now see that your data has been transfered from the local Dataiku filesystem to HDInsight in a couple of clicks. In our example, the dataset residing in HDInsight is called “cs_training_wasb”.
You just used the “hdfs_managed” connection to write to HDInsight. This is a default name provided by Dataiku, but in the case the HDInsight, HDFS means in fact the Azure “Wasb” primary storage system for HDInsight, on Azure Blob Storage. You can verify that your data has been written to Azure Blob Storage by browsing the container storing your HDInsight data via the Azure portal:
A few steps are required to make your dataset ready for Machine Learning. Open the newly created dataset in HDInsight (“cs_training_wasb”), and click the “Lab” button at the top right. Create a new “Visual Analysis”. The data preparation is very simple:
When ready, deploy the preparation script, store again into “hdfs_managed” (i.e Azure Wasb), and select “Parquet” as the storage format. Click “Deploy” (without building the Dataset yet).
The Visual Data Preparation recipe can now leverage HDInsight and be run on Spark. Click the small cog under the “Run” button at the bottom left, and select “Spark” as the execution engine.
Click “Run” and observe the progress of your Spark job in the “Jobs” menu of Dataiku. The complete Yarn and Spark logs will be surfaced there, allowing to track progress and debug your jobs in case of issues. Wait for the job to complete.
The data is now ready for machine learning. Open the last dataset created (“cs_training_wasb_prepared” if you left the default name), click on “Lab” again, but select “Quick model” this time, then select “Prediction”. Select “SeriousDlqin2yrs” as the target variable, then “MLLib” as the backend for training the model:
Create the analysis, and before actually training the model, click “Settings” at the top right of your screen. Review the various underlying settings of your model, then in the Algorithms section, add a “Random Forest” with 80 trees to the list of algorithms to test.
Click “Save”, then “Train”. The entire process of training the Machine Learning to predict credit default will be delegated to HDInsight and more specifically to MLLib, the Spark set of functionalities dedicated to distribued machine learning. Dataiku will also leverage Spark for any preprocessing required before being able to train the model.
Wait for your models to train. Once ready, you may observe very different performances between the logistic regression and the random forest. The random forest has a much higher AUC, indicating a better performance.
Open the random forest results. You can now analyze insights related to the model, for instance which variables contribute the most to predicting credit default, or assess the performance of your classifier through various screens:
The model is now ready to be operationalized. Start with deploying your model to the Flow.
To simulate how using the model would like, you can:
Please do not forget to change the inputs / outputs of each of your recipes. When done, you can use the model through the “Score” recipe to get the predictions for the test set. Your final workflow may look like the following:
From there, 3 options are available:
Please consult the corresponding documentation for more information on each of these options.
Your HDInsight-powered machine learning workflow is now ready. Dataiku acts as the top, visual layer to build the pipeline, while Spark is used as the execution engine of this pipeline. This could be easily verified from the Flow, where the “Engines” view highlights each of the backends used for the pipeline:
This article showed how a MLLib predictive model of credit default could be trained and deployed using only a few visual components in Dataiku: Sync, Visual Data Preparation and Visual Machine Learning. Other advanced options remain of course available to develop more sophisticated workflows, making use for instance of Code Recipes, either MapReduce based (Hive, Pig) or Spark based (Scala, PySpark, SparkR or SparkSQL).
Leveraging the integration between Dataiku and Microsoft HDInsight opens up the ability to quickly develop and run a complete data science workflow at scale in the cloud, in minutes.
Applies to: DSS 4.0 and above