Do you know Kaggle and its slogan “making data science a sport”?
Kaggle is a nice platform for predictive modeling competitions where the best data scientists face against each other, trying to improve their models by 0.01 point of performance.
At Dataiku we love challenges so we jumped into one of these contests: the blue book for Bulldozers.
So here is what we did.
The goal of this contest was to predict the sale price of bulldozer based on its model, age, description and a bunch of options.
Two datasets were released:
After a short round of data exploration, we quickly understood that there were different categories of bulldozers with significant variations of price. So our main idea was to used a random forest on each category of bulldozer.
A random what? Here a great explanation of random forest concept.
The first difficulty was the data itself, 53 features, with a lot of missing or erroneous values. Real life data science... For example, some trucks were sold before their year of manufacture.
Others were sold more than once, so a machineID could appear many times in the train set, and the description could change between these different rows.
Kaggle provided us with a machine appendix with the “real” value of each feature and for each machines, but it turned out that replacing by the true value was not a good idea. Indeed, we think that each seller could declare or not characteristics on the auction website and it had an impact on the price.
As for the second point, we focused on the volatility of some models. We spent a lot of time trying to understand how a machine could be sold the same year, and even with only a few days between two sales, at two completely different prices. It turned out for us that it was not easily predictable. In financial theory, the model used to describe this kind of randomness is call random walk.
We tried a lot of things:
An insight of our code:
Pandas is very useful to select some values, for example to select only the years we needed:
We built one model per category:
Scikit-learn provides with a large set of machine learning models, very fast and simple to use. The next 3 lines show how we defined our model, trained it and got the prediction:
We did a grid search to compute the best parameters of the random forest.More information about the parameters of the random forest regressor..
As a last step, we ran a post treatment where we kept the min and the max for each model (there are about 4000 differents models of bulldozer in the data) and replaced the price when our prediction was out of the bounds.
Finally, we were very happy to reach the 20th place on the final leaderboard (top 5%)!
So if you want to practice data science, we recommend to try Kaggle. This is fun and you’ll learn a lot. ;)
Please fill out the form below to receive the success story by email:
How can we come back to you ?