Running predictive models is pretty easy in the Data Science Studio. In a few clicks, you have the ability to predict a variable from the data you have. Once you have tried, you will never see your data the same way. You will want to try with anything. Find out how I used the studio to predict survival from the sinking of the Titanic.
This blog post is the second one of a series. In the previous post, I imported the train dataset containing information about 891 passengers of the Titanic. The idea is to find out what sorts of people were more likely to survive the shipwreck.
The exploration of the dataset with the Studio already gave us a good overview:
Find below a screenshot of the dataset we are going to use for the predictions. Remember that we have generated a new column with the title of each passenger.
The next step is definitely the coolest part of our work. The goal is to predict the Survived variable (1=survived, 0=deceased) of a passenger. We want to find a model that gives the best predictions.
A variety of models can be used. They use techniques of statistics and machine learning. Basically, a model would use all the information available about the passengers to learn what are the characteristics of a survivor. Then, it would output a decision (survive or not) for other passengers (that we would have kept with the objective of testing our model).
To select the best model, we score each one. Two famous metrics are:
Before going further, let's illustrate with a basic case what can be a model. (If you already know what is a predictive algorithm, just skip this section.)
If we keep only the Survived, Age and Sex variables of our passengers and from what we saw in the previous example (ie. using our own sense), we could create a simple tree of decision to predict if a passenger survives:
If we apply this algorithm on a sample of our dataset, we would get something like that:
|40||male||0||0||-> good prediction !|
|18||female||0||1||-> bad prediction|
|24||female||1||1||-> good prediction !|
|7||female||1||1||-> good prediction !|
|47||male||1||0||-> bad prediction|
Accuracy on this sample = 3/5 = 0.6
It is a really basic and naive predictive model. We can expect a bad accuracy score if we run this model on the full dataset. Plus, remember that we have 20% of missing data for the Age variable. Because our algorithm doesn't handle this case, we are already sure that the accuracy cannot be better than a 80%.
The Kaggle website actually published a tutorial to build a similar algorithm with Excel. Even though it is really not a great way to run predictions in reality, it is good to practice.
The Data Science Studio provides four predictive models for running a classification of this kind within our intuitive graphical interface: Logistic Regression, Random Forest, Support Vector Machine and Stochastic Gradient Descent. The studio will automatically choose the best parameters for you, which is great when you are a beginner like me.
Let's run our first predictions with the default settings and then explore what we get.
By default, three models were run with adapted settings: two with the Random Forest algorithms (with different parameters) and one with the Logistic Regression. Let's explore results of the second Random Forest model.
We get an accuracy score of 0.8324. It means that our model would predict well survival for 83% of our passengers (it was calculated on a different sample, there is more about it later in this post). It is quite a good result, isn't it?!
The studio automatically pre-processed the data before modelling:
The chart of variables importance highlights the variables that the algorithm most used to decide whether each passenger survives or not. In our case, the most important elements are: male , Mr. , passenger class, fare and Miss. It seems to reinforce our first analysis from the previous article.
The confusion matrix goes further in the understanding of the predictions. It compares the repartition between the actual values of survival with the predicted values.
In order to have a better model, I decide to make some changes in the default parameters choosen by the studio. I drop the PassengerId column (not meaningfull at all for the prediction) and dummify the passenger class variable to count the separate impact of each value.
One last thing you need to know: only 80% of the passengers of our dataset were used to train our model and 20% were kept to test the model. This technique is known as cross validation: we score the model on a different sample that the one used for the training. It is good practice to detect overfitting (one of the worst nightmares of a data scientist I was told).
When you are done with your model training, you should use the full dataset to obtain the best model. Let's do that.
As said in the previous post, the Titanic problem is part of a competition on Kaggle. Until now, we used a dataset of 891 passengers for whom we know if they survived or not. Kaggle provides another dataset of 418 other passengers without revealing if they survived or not. The challenge is to run our model on this dataset, to export the result in a csv file and then to upload it on the website to get a score.
Let's download the test.csv file and upload on the studio. In a few clicks, we apply the same modifications we did. Once done, we apply our model to the new dataset. A new column appears in the dataset: predicted values.
For these 418 passengers, we now have predictions about the survival. This is the moment you really feel the power of Data Science :-)
The last step is to export the results as a csv file and to upload it on Kaggle to get a score.
And the result is... 78% of our predictions were correct on this dataset. It is quite good !
Making our first predictions did not look too complicated, right ? Leave a comment if something is unclear for you.
We could now work to get a better predictive model but that goes further than the topic of this introduction.
Jeremy, a marketing guy learning Data Science.
Please fill out the form below to receive the success story by email:
How can we come back to you ?