A Kaggle data science competition made easy [Learn]

Technology|Data Science|Machine Learning| December 08, 2015| Alivia

At Dataiku, every new member of the team, either marketer or superstar data scientist, has to try DSS out with the Titanic Kaggle Competition. It's such a milestone in the company that our first meeting room was named after it! So I decided to write an article on how to make your first kaggle submission in 5 minutes and be apart of the team.



Titanic Kaggle Competition


Kaggle is a great site where companies or researchers post data and make it available for data scientists, statisticians and pretty much anyone to play around and find insights. Many of our data scientists participate in Kaggle competitions, but our favorite is still the Titanic. Why? Because anyone can understand it: the goal of the challenge is to predict who on the Titanic will survive.

Let's go right ahead and save some lives! Follow these easy steps to become a kaggler.

NB: This is an easy project. Nonetheless it'll go better if you've completed all three of our "getting started" tutorials, or at least have some experience of Data Science Studio.

Step 1: Collect your data from Kaggle

Before you really get started, you'll need to create an account on Kaggle. You can go right ahead, these guys aren't big on spamming (at all).

Now, to begin the challenge, go to this link. This is where you'll get your data sources.

As you can see, there are many different datasets. We only need two for our first submission: the train dataset and the test dataset. So click on these and download them.

Now open up your Data Science Studio (or download the community edition over here).

Create a new project. You can even set the project image right away to this one.

DSS titanic project Alivia Smith

Now we can start the project. Click to Import you First Dataset. Upload both csv files (separately) to create a test and a train dataset.

Now you can go check out your flow; this is what it should look like:

Dataiku DSS kaggle titanic project

Test vs Train

When working on machine learning projects, you'll always be working with a test and a train dataset.

The train dataset is a set of incidents that have already been scored. In other words you have the target feature you're going to try to predict with your algorithm. This is the dataset that you'll be training your algorithm on (hence, the name).

Your test dataset is the dataset that you'll be deploying your algorithm on to score the new instances. In our case this is the dataset we'll be submitting to Kaggle. In this case it's called test because it's the dataset used by Kaggle to test the results of your algorithm, and make sure you didn't over-fit your model. In general it'll just be the data that comes in, that you want to score.

Tip: Both datasets have to have the same features of course; so at Dataiku we'll often stack them at the beginning of our data wrangling process, and then split them just before training our algorithm.

Step 2: Build your machine learning model

Now you can go ahead and click on your train dataset to explore the data.

Our goal today is to submit on Kaggle as fast as possible, so I won't go into analyzing the different features and cleaning or enriching the data. At a glance you can notice that we have a unique passenger id, and 11 features including the feature we want to predict: survived.

Let's go ahead and click on analyze and create a new analysis.

The next step is to go straight to the header of the Survived feature, click and select build prediction model. Select Create and your model is ready to train. Now THAT was easy.

Dataiku DSS kaggle titanic project

Click on train. VOILÀ, our model is done!

Dataiku DSS kaggle titanic project

By default DSS trains a random forest and logistic regression algorithms and ranks them by performance as measured by the ROC AUC. The Random Forest algorithm outperforms the other here, so let's click on that one.

Dataiku DSS kaggle titanic project

This article is really good at describing all the different performance indicators of algorithms so I won't get into that.

It is interesting though to go have a quick look at the Variables Importance. Pretty unsurprisingly, gender is the most decisive feature, as well as how much a passenger paid and the class. As far as our model is concerned, the Titanic wasn't so much about "women and children first" as much as "rich women before rich men."

Dataiku DSS kaggle titanic project

Let's not dwell on the past and move on.

Step 3: Deploy your model

Now that you've built your model, its time to go ahead and deploy it. So go to the right corner and click deploy. Keep the default name and deploy it on the train dataset.

You have a new step in your flow! Check out that model ;)

Dataiku DSS kaggle titanic project

Step 4: Apply your model

Our next step is to apply that model to our test dataset. In your flow, click on your model and then click on the Apply recipe on the right. Select the test dataset and hit Create Recipe.

Run your model with the Default settings.

Then explore the output dataset test_scored. You can see that there are three new columns: a prediction column at the far right with whether the models predicts that passenger survives or not, and 2 columns with the probability for each output.

Dataiku DSS kaggle titanic project

Step 5: Format your output dataset

The final step is to prepare to submit. Kaggle requires a certain format for a submission: a csv file with 2 columns, the passenger id and the predicted output, with specific column names.

So go ahead and create an analysis of your scored dataset.

Add a step in the shaker to keep only the PassengerId and the prediction columns. Then rename the prediction column "Survived."

Dataiku DSS kaggle titanic project

Now that you've got the right format, deploy the script, call the dataset what you'd like and select to Build the new dataset now.

When you're back to the flow, click on your final dataset. On the right, click on more and Download it (in CSV).

Dataiku DSS kaggle titanic project

Step 6: It's time to submit!

Go to this page to make your submission! You can just drag and drop that CSV file and submit.

kaggle titanic alivia smith submit

I got a score of 0.71292 (so you'll probably get the same score), which isn't a bad ROC AUC. And I ranked 3720th. Out of 3922. Ok that's not great, but it leaves a lot of room to improve!

I guess I'll need to write another blogpost on how to work on making that ranking higher...

Now go ahead and write me to let me know how you do and how you improved that score !

And if you want to see what other people from the Dataiku team did, you can check out these articles:

Receive success story

Please fill out the form below to receive the success story by email:

Contact us

How can we come back to you ?