Here’s a brief description of a Dataiku marketers first Kaggle competition - and remember, this Dataiku marketer is me, and I'm no techy.
For those of you who already read my latest blog post (“My First Three Weeks as a Dataiku Marketer" you already know that my very first interaction with the data world was the day I joined Dataiku and started the DSS tutorials. As I don’t speak Python or R (yet), I'm still only using DSS’s visual interface for my personal projects.
Three weeks ago, Dataiku announced the release of DSS V2. As you can imagine, the whole Dataiku team was super excited not only about the announcement but especially about using DSS V2. That's why on May 20th, Kenji, Dataiku’s Product Manager, gave the sales and marketing team a demo and presentation of DSS’s new look, new feel, and new functionalities. Within minutes of Kenji finishing his presentation, we decided to start testing DSS V2 for ourselves. And what better way to test a data science tool than by competing in a Kaggle Competition? That's when we started the West Nile Kaggle challenge - for which the goal is to predict West Nile virus mosquitos in the city of Chicago.
Gathered in a conference room, in a pleasantly competitive atmosphere, we began playing with DSS. I insist on the word playing because it really did feel like a game.
At first, I was a little scared. I had datasets, I had DSS. Ok... but I had to figure out what to do next. I already knew I wasn't competing for the top rank but I didn’t want to be the dumb kid in class either. I carefully read the Kaggle indications, studied the datasets, and decided to go about it one step at a time. So here’s a brief description of a Dataiku marketers first Kaggle competition - and remember, this Dataiku marketer is me, and I'm no techy.
With DSS it was really easy to import the datasets and to immediately start cleaning the data without a single line of code. The challenge offered multiple datasets but, for my first submission, I decided to use 2 datasets:
Thanks to the geopoint column I’d created in both datasets, I used the DSS Join Recipe to join the spray and train datasets.
Kurt, businessman by day and geek by night, proceeded to give me another helpful tip: retrieve and remove un-used columns from the test dataset. I therefore decided to remove the addresses column because it wasn't in the dataset I used to train the model and because a model cannot be applied to information it has never been trained on.
DSS offers multiple options to build models and includes algorithms from the open source library Scikit Learn. After testing a few algorithms including logistic regression, I noticed that the AUC was higher with Random Forest. Therefore, I chose to use a Randome Forest model in order to predict the appearance of the Nile virus. I trained the model on the new dataset (result of the join and spray dataset I had previously cleaned).
Then, I proceded to parse dates and to create a geoppoint column on the test dataset. Finally, I applied the model on the new test dataset and submitted my prediction.
The final workflow looked like this:
And here is my rank, for a first submission I’m not top ranked as expected but I’m not the lowest ranked either. Not going to lie: I’m proud of myself!
If you want to go further, I definitely suggest you read this blog post by Henri, a Dataiku Data Scientist, who was ranked 60th for his first submission (over 411 teams).
Please fill out the form below to receive the success story by email:
How can we come back to you ?