Caterpillar Challenge

This is a notebook on the kaggle caterpillar challenge : https://www.kaggle.com/c/caterpillar-tube-pricing

The goal is to be able to predict the price of a tube.

This notebook is designed to be a template to show possibilities with pandas and scikit learn. Do not hesitate to explore by yourself !

Loading Data

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import random as rd

# scikit learn 
from sklearn import cross_validation
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction import DictVectorizer
from sklearn.ensemble import RandomForestRegressor
In [2]:
# if you work dataiku DSS
'''
import dataiku
train = dataiku.Dataset("train_set").get_dataframe()
test = dataiku.Dataset("test_set").get_dataframe()
specs =  dataiku.Dataset('specs').get_dataframe()
tube =  dataiku.Dataset("tubes").get_dataframe()
bills =  dataiku.Dataset('bill_of_materials').get_dataframe()
'''
Out[2]:
'\nimport dataiku\ntrain = dataiku.Dataset("train_set").get_dataframe()\ntest = dataiku.Dataset("test_set").get_dataframe()\nspecs =  dataiku.Dataset(\'specs\').get_dataframe()\ntube =  dataiku.Dataset("tubes").get_dataframe()\nbills =  dataiku.Dataset(\'bill_of_materials\').get_dataframe()\n'
In [3]:
# if you work without dataiku DSS
train = pd.read_csv("/Users/pgutierrez/Downloads/competition_data/train_set.csv")
test = pd.read_csv("/Users/pgutierrez/Downloads/competition_data/test_set.csv")
specs =  pd.read_csv('/Users/pgutierrez/Downloads/competition_data/specs.csv')
tube =  pd.read_csv("/Users/pgutierrez/Downloads/competition_data/tube.csv")
bills =  pd.read_csv('/Users/pgutierrez/Downloads/competition_data/bill_of_materials.csv')

A first look at the data

In [4]:
# head show the first lines of dataset
train.head(20)
Out[4]:
tube_assembly_id supplier quote_date annual_usage min_order_quantity bracket_pricing quantity cost
0 TA-00002 S-0066 2013-07-07 0 0 Yes 1 21.905933
1 TA-00002 S-0066 2013-07-07 0 0 Yes 2 12.341214
2 TA-00002 S-0066 2013-07-07 0 0 Yes 5 6.601826
3 TA-00002 S-0066 2013-07-07 0 0 Yes 10 4.687770
4 TA-00002 S-0066 2013-07-07 0 0 Yes 25 3.541561
5 TA-00002 S-0066 2013-07-07 0 0 Yes 50 3.224406
6 TA-00002 S-0066 2013-07-07 0 0 Yes 100 3.082521
7 TA-00002 S-0066 2013-07-07 0 0 Yes 250 2.999060
8 TA-00004 S-0066 2013-07-07 0 0 Yes 1 21.972702
9 TA-00004 S-0066 2013-07-07 0 0 Yes 2 12.407983
10 TA-00004 S-0066 2013-07-07 0 0 Yes 5 6.668596
11 TA-00004 S-0066 2013-07-07 0 0 Yes 10 4.754539
12 TA-00004 S-0066 2013-07-07 0 0 Yes 25 3.608331
13 TA-00004 S-0066 2013-07-07 0 0 Yes 50 3.291176
14 TA-00004 S-0066 2013-07-07 0 0 Yes 100 3.149291
15 TA-00004 S-0066 2013-07-07 0 0 Yes 250 3.065829
16 TA-00005 S-0066 2013-09-01 0 0 Yes 1 28.374220
17 TA-00005 S-0066 2013-09-01 0 0 Yes 2 16.514303
18 TA-00005 S-0066 2013-09-01 0 0 Yes 5 9.397795
19 TA-00005 S-0066 2013-09-01 0 0 Yes 10 7.027481
In [5]:
test.head(10)
Out[5]:
id tube_assembly_id supplier quote_date annual_usage min_order_quantity bracket_pricing quantity
0 1 TA-00001 S-0066 2013-06-23 0 0 Yes 1
1 2 TA-00001 S-0066 2013-06-23 0 0 Yes 2
2 3 TA-00001 S-0066 2013-06-23 0 0 Yes 5
3 4 TA-00001 S-0066 2013-06-23 0 0 Yes 10
4 5 TA-00001 S-0066 2013-06-23 0 0 Yes 25
5 6 TA-00001 S-0066 2013-06-23 0 0 Yes 50
6 7 TA-00001 S-0066 2013-06-23 0 0 Yes 100
7 8 TA-00001 S-0066 2013-06-23 0 0 Yes 250
8 9 TA-00003 S-0066 2013-07-07 0 0 Yes 1
9 10 TA-00003 S-0066 2013-07-07 0 0 Yes 2

What can we derive from this ?

  • the target cost seems to be associated with a tube_id and a quantity.
    -> the higher the quantity the lower the cost : it must be the cost per unit. It decrease with quantiy (market laws).
    -> the train / test split seems to have been done by tube_assembly_id : very important for our cross validation scheme

  • there seems to be a dependance on the date. but it does not affect train and test split.

In [6]:
# let's verify the train test split on tube_assembly_id : 
trainids = set(train['tube_assembly_id'].unique())
testids = set(test['tube_assembly_id'].unique())
trainids.intersection(testids)
Out[6]:
set()

The intersection of sets is empty so we don't have any train an test common tube_assembly_id

Let's have a look at the other datasets.

In [7]:
specs.head()
Out[7]:
tube_assembly_id spec1 spec2 spec3 spec4 spec5 spec6 spec7 spec8 spec9 spec10
0 TA-00001 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 TA-00002 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 TA-00003 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 TA-00004 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 TA-00005 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
In [8]:
# dropna() drop rows with missing values. a subset of columns can be given.
specs.dropna(subset=['spec1']).head()
Out[8]:
tube_assembly_id spec1 spec2 spec3 spec4 spec5 spec6 spec7 spec8 spec9 spec10
12 TA-00013 SP-0004 SP-0069 SP-0080 NaN NaN NaN NaN NaN NaN NaN
14 TA-00015 SP-0063 SP-0069 SP-0080 NaN NaN NaN NaN NaN NaN NaN
17 TA-00018 SP-0007 SP-0058 SP-0070 SP-0080 NaN NaN NaN NaN NaN NaN
18 TA-00019 SP-0080 NaN NaN NaN NaN NaN NaN NaN NaN NaN
19 TA-00020 SP-0057 SP-0067 SP-0080 NaN NaN NaN NaN NaN NaN NaN
In [9]:
tube.head()
Out[9]:
tube_assembly_id material_id diameter wall length num_bends bend_radius end_a_1x end_a_2x end_x_1x end_x_2x end_a end_x num_boss num_bracket other
0 TA-00001 SP-0035 12.70 1.65 164 5 38.10 N N N N EF-003 EF-003 0 0 0
1 TA-00002 SP-0019 6.35 0.71 137 8 19.05 N N N N EF-008 EF-008 0 0 0
2 TA-00003 SP-0019 6.35 0.71 127 7 19.05 N N N N EF-008 EF-008 0 0 0
3 TA-00004 SP-0019 6.35 0.71 137 9 19.05 N N N N EF-008 EF-008 0 0 0
4 TA-00005 SP-0029 19.05 1.24 109 4 50.80 N N N N EF-003 EF-003 0 0 0
In [10]:
bills.head()
Out[10]:
tube_assembly_id component_id_1 quantity_1 component_id_2 quantity_2 component_id_3 quantity_3 component_id_4 quantity_4 component_id_5 quantity_5 component_id_6 quantity_6 component_id_7 quantity_7 component_id_8 quantity_8
0 TA-00001 C-1622 2 C-1629 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 TA-00002 C-1312 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 TA-00003 C-1312 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 TA-00004 C-1312 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 TA-00005 C-1624 1 C-1631 1 C-1641 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

basic informations & aggregations

In [11]:
# whzt is the size of the datasets ? 
print "train : ", train.shape
print "test : ", test.shape
train :  (30213, 8)
test :  (30235, 8)

In [12]:
# checking distributions of categories : 
print train['bracket_pricing'].unique() # list of unique elements
['Yes' 'No']

In [13]:
print train.groupby('bracket_pricing').size() # number of lines for each element
bracket_pricing
No      3930
Yes    26283
dtype: int64

** for more informations on how to use group by : ** http://pandas.pydata.org/pandas-docs/stable/groupby.html

In [14]:
# let's answer the question :
# do we always have 8 quantity for each tube ? 

tmp = train.groupby("tube_assembly_id").size() # produce a serie of sizes
tmp = tmp.reset_index()                         # from the serie, generate dataframe with two columns
tmp.columns = ['tube_assembly_id','thesize']   # re,naming of the columns

# now let's have a look at the size distribution  :
tmp2 = tmp['thesize'].value_counts() # value_counts fction = groupby + size + sort 
tmp2 = tmp2.reset_index()
tmp2.columns = ['thesize','nb_occurences']
tmp2['percent'] = tmp2['nb_occurences']/float(tmp2['nb_occurences'].sum())
tmp2
Out[14]:
thesize nb_occurences percent
0 1 4300 0.485601
1 8 2201 0.248560
2 3 1020 0.115189
3 2 509 0.057482
4 4 304 0.034331
5 5 252 0.028458
6 6 169 0.019085
7 7 86 0.009712
8 9 11 0.001242
9 14 1 0.000113
10 12 1 0.000113
11 10 1 0.000113

In the end, tubes with 8 lines in train represent approximately 25 % of the total

basic plots

In [15]:
# very skewed distribution of cost
train['cost'].hist()
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x109226d50>
In [16]:
# using matplotlib with seaborn gives better visual results 
plt.scatter(train['quantity'],np.log(train['cost']),alpha =0.5)
Out[16]:
<matplotlib.collections.PathCollection at 0x1091daa10>
In [17]:
tubes  = list(train['tube_assembly_id'].unique())
for tub in tubes[:30] :
    tmp = train[train['tube_assembly_id']==tub]
    tmp['cost'] = tmp['cost']/float(tmp['cost'].max())
    plot(tmp['quantity'], tmp['cost'])
-c:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

  • A lot of "discount" seems to be done the same way (curves have a similar shape). But there are a lot of outliers.
  • as a result the feature of the number of quantity appearing in dataset may be of interest (since a gbt could use it to learn curves shapes)
In [18]:
tmp = train.groupby('tube_assembly_id').size().reset_index()
tmp.columns = ['tube_assembly_id','thesize']
tubes  = list(tmp[tmp['thesize']==8]['tube_assembly_id'].unique())

for tub in tubes[:30] :
    tmp = train[train['tube_assembly_id']==tub]
    tmp['cost'] = tmp['cost']/pow(float(tmp['cost'].max()),1)
    plot(tmp['quantity'], tmp['cost'])
-c:7: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [19]:
tmp = train.groupby('tube_assembly_id').size().reset_index()
tmp.columns = ['tube_assembly_id','thesize']
tubes  = list(tmp[tmp['thesize']==3]['tube_assembly_id'].unique())

for tub in tubes[:30] :
    tmp = train[train['tube_assembly_id']==tub]
    tmp['cost'] = tmp['cost']/pow(float(tmp['cost'].max()),1)
    plot(tmp['quantity'], tmp['cost'])
  • way harder to say anything about 3 time appearance tubes.
  • though in a two layer approach it would be interesting to set other quantities evaluation as features. good examples of multilayer approach can be found here :
  • https://medium.com/@chris_bour/6-tricks-i-learned-from-the-otto-kaggle-challenge-a9299378cd61
  • https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/14335/1st-place-winner-solution-gilberto-titericz-stanislav-semenov

Reshaping and Feature Engineering with pandas

merging

see http://pandas.pydata.org/pandas-docs/stable/merging.html for more information about merging in pandas

In [20]:
# merge for joining dataset
train = pd.merge(train,tube,on = 'tube_assembly_id', how = 'left')
test = pd.merge(test,tube,on = 'tube_assembly_id', how = 'left')
In [21]:
# merge for joining dataset
train = pd.merge(train,specs,on = 'tube_assembly_id', how = 'left')
test = pd.merge(test,specs,on = 'tube_assembly_id', how = 'left')
In [22]:
# merge for joining dataset
train = pd.merge(train,bills,on = 'tube_assembly_id', how = 'left')
test = pd.merge(test,bills,on = 'tube_assembly_id', how = 'left')
In [23]:
# aggregations can be calculated with groupby() and a function
train.head()
Out[23]:
tube_assembly_id supplier quote_date annual_usage min_order_quantity bracket_pricing quantity cost material_id diameter ... component_id_4 quantity_4 component_id_5 quantity_5 component_id_6 quantity_6 component_id_7 quantity_7 component_id_8 quantity_8
0 TA-00002 S-0066 2013-07-07 0 0 Yes 1 21.905933 SP-0019 6.35 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 TA-00002 S-0066 2013-07-07 0 0 Yes 2 12.341214 SP-0019 6.35 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 TA-00002 S-0066 2013-07-07 0 0 Yes 5 6.601826 SP-0019 6.35 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 TA-00002 S-0066 2013-07-07 0 0 Yes 10 4.687770 SP-0019 6.35 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 TA-00002 S-0066 2013-07-07 0 0 Yes 25 3.541561 SP-0019 6.35 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 49 columns

Dealing with dates

In [24]:
# parsing of the date
train['quote_date'] = pd.to_datetime(train['quote_date'],format="%Y-%m-%d")
test['quote_date'] = pd.to_datetime( test['quote_date'],format="%Y-%m-%d")
In [25]:
# dayof week having an influence on the price ?  
train['dayofweek'] = train['quote_date'].map(lambda x : x.dayofweek)
test['dayofweek'] = test['quote_date'].map(lambda x : x.dayofweek)

# week of year ?
train['weekofyear'] = train['quote_date'].map(lambda x : x.weekofyear)
test['weekofyear'] = test['quote_date'].map(lambda x : x.weekofyear)
In [26]:
# discard date variable
train = train.drop('quote_date',axis=1)
test = test.drop('quote_date',axis=1)

Misc

In [27]:
# function like sum can also be applied horizontally : 
quantvars = ['quantity_'+str(i) for i in range(1,9)]
train['total_quantity'] = train[quantvars].sum(axis = 1)
test['total_quantity'] = test[quantvars].sum(axis = 1)
In [28]:
# more use of map
train['bracket_pricing'] = train['bracket_pricing'].map(lambda x : 1 if x == 'Yes' else 0)
test['bracket_pricing'] = test['bracket_pricing'].map(lambda x : 1 if x == 'Yes' else 0)

train['end_a_1x'] = train['end_a_1x'].map(lambda x : 1 if x == 'Y' else 0)
test['end_a_1x'] = test['end_a_1x'].map(lambda x : 1 if x == 'Y' else 0)

train['end_a_2x'] = train['end_a_2x'].map(lambda x : 1 if x == 'Y' else 0)
test['end_a_2x'] = test['end_a_2x'].map(lambda x : 1 if x == 'Y' else 0)

train['end_x_1x'] = train['end_x_1x'].map(lambda x : 1 if x == 'Y' else 0)
test['end_x_1x'] = test['end_x_1x'].map(lambda x : 1 if x == 'Y' else 0)

train['end_x_2x'] = train['end_x_2x'].map(lambda x : 1 if x == 'Y' else 0)
test['end_x_2x'] = test['end_x_2x'].map(lambda x : 1 if x == 'Y' else 0)
In [29]:
train['end_a_1x'].unique()
Out[29]:
array([0, 1])
In [30]:
# let's add the variable on the number of tube ids appearing in dataset
tmp  = train.groupby('tube_assembly_id').size().reset_index()
tmp.columns = ['tube_assembly_id','nb_appearance']
train = pd.merge(train, tmp, how='left', on = 'tube_assembly_id')

tmp2  = test.groupby('tube_assembly_id').size().reset_index()
tmp2.columns = ['tube_assembly_id','nb_appearance']
test = pd.merge(test, tmp2, how='left', on = 'tube_assembly_id')
In [31]:
tmp['nb_appearance'].hist()
plt.figure()
tmp2['nb_appearance'].hist()
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x1085687d0>

these two distributions seem similar so we can use this feature. Remember that if a feature distribution is very different in train set and test set, you may have some problem of reproducibility.

Scikit learn

For the Machine learning part we will use scikit learn : http://scikit-learn.org/stable/documentation.html Their User Guide is a machine learning class in itself. It is really worth reading.

Scikit learn class scheme :

  • fit : learn from the dataset given (supervised or not) ex : Kmeans , RandomForest or PCA

  • transform : if a transformation was learned, transform modifies the given dataset. ex : apply pca to modify the space

  • predict : if the fit was supervised, will predict the labels of a dataset.

  • others methods include :
    • predict_probas : when classification and metric is AUC for example
    • fit_transform : concatenation of fit and transform)

Beware : scikit learn take numpy arrays as input and not pandas dataframe (could lead to bugs). Use mydataset.values to do the convertion.

A first Model with scikit learn

Handling missing values

Scikit learn does not handle missing values so you have to take care of it before using any scikit learn class.

In [32]:
# feature types definition

numerical_columns = ['annual_usage','min_order_quantity','quantity','wall','length','num_bends','bend_radius'
                     , 'num_boss', 'num_bracket', 'other' ,'dayofweek', 'weekofyear','total_quantity', 'nb_appearance']

# add already dummified
numerical_columns = numerical_columns + ['bracket_pricing',u'end_a_1x', u'end_a_2x', u'end_x_1x', u'end_x_2x']
# add quantities
for i in range(1,9) :
    numerical_columns.append('quantity_'+str(i))

categorical_columns = ['supplier','material_id','diameter','end_a','end_x']
# add specs
for i in range(1,9) :
    categorical_columns.append('component_id_'+str(i))
# add compnent ids
for i in range(1,11) :
    categorical_columns.append('spec'+str(i))
In [33]:
# handling numerical missing values
train[numerical_columns]= train[numerical_columns].fillna(-9999)
test[numerical_columns]= test[numerical_columns].fillna(-9999)
# we use -9999 if tree based method because it will be able to seperate it, provided it's deep enought. 
# for linear models or neural networks, imputing mean or median would be better. 


# handling categorical missing values
train[categorical_columns]= train[categorical_columns].fillna('unknown')
test[categorical_columns]= test[categorical_columns].fillna('unknown')
# let's handle these as a class every time.

A first look at the valuation metric

In [46]:
# here is the evaluation metric function for the Kaggle Chellenge
def RMSLE(pred, real) :
    tmp = np.log(1+pred) - np.log(1+real)
    return np.sqrt(np.mean(tmp*tmp))

# and here is the RMSE
def RMSE(pred, real) :
    tmp = pred - real
    return np.sqrt(np.mean(tmp*tmp))

Note that RMSLE is very similar to RMSE except you take the log difference instead of the difference before you square.

As a result, we will modify the target to be log(cost + 1) and apply the transformation x -> exp(x) - 1 on the prediction. Indeed, most models minimize the RMSE by design. As a result this transformation allow us to optimize directly the RMSLE !

In [35]:
train['cost'] = np.log(train['cost']+1)

Separating dataset in train an valid

The number of submissions on Kaggle challenges are often limited to a few per day. As a result, you often have to rely on your own cross validation scheme. The first thing to do is to separate train and validation set.

In [36]:
# let's separate labels from the rest
y = train["cost"].values
X = train.drop(['cost','tube_assembly_id'],axis=1)
tube_ids = train['tube_assembly_id']

X_test = test.drop(['tube_assembly_id','id'],axis=1)
tube_ids_test = test['tube_assembly_id']
ids_test = test['id'] # needed for kaggle submission.
In [37]:
# scikit learn provide different type of cross validation functions

# this is an example of a simple CV scheme 80 % for train and 20 % for test
X_train, X_valid, y_train, y_valid = cross_validation.train_test_split(X, y, test_size=0.2, random_state=0)
print X_train.shape, X_valid.shape

# and this is a 5-fold cross validation example
kf = cross_validation.KFold(X.shape[0], n_folds=5, random_state=0)
for train_index, valid_index in kf:
    print("TRAIN:", len(train_index), "TEST:", len(valid_index))
    X_train, X_valid = X.values[train_index], X.values[valid_index]
    y_train, y_valid = y[train_index], y[valid_index]
    # do model and evaluation here

# do not forget to set your random state !
(24170, 50) (6043, 50)
('TRAIN:', 24170, 'TEST:', 6043)
('TRAIN:', 24170, 'TEST:', 6043)
('TRAIN:', 24170, 'TEST:', 6043)
('TRAIN:', 24171, 'TEST:', 6042)
('TRAIN:', 24171, 'TEST:', 6042)

In [38]:
# But at the end it is not the best evaluation because we have several lines for each tube assembly id
# as a result we will make a split on the tube assembly id : 

unique_tubes = np.unique(tube_ids)
train_tube_ids = rd.sample(unique_tubes, int(len(unique_tubes)*0.8))

X_train = train[train['tube_assembly_id'].isin(train_tube_ids)].drop(['cost','tube_assembly_id'],axis=1)
X_valid = train[-train['tube_assembly_id'].isin(train_tube_ids)].drop(['cost','tube_assembly_id'],axis=1)

y_train = train[train['tube_assembly_id'].isin(train_tube_ids)]["cost"].values
y_valid = train[-train['tube_assembly_id'].isin(train_tube_ids)]["cost"].values

print X_train.shape
print X_valid.shape
(23927, 50)
(6286, 50)

Rescaling

In [39]:
# usefull if you use certain type of ML models such as logistic regression / linear regression
# in this challenge we don't really need rescaling (because tree methods don't care)
exvars = ["annual_usage",'length']
scaler= StandardScaler() # class definition

tmp = X_train[exvars].values
X_train[exvars] = scaler.fit_transform(tmp)
X_valid[exvars] = scaler.transform(X_valid[exvars])
X_test[exvars] = scaler.transform(X_test[exvars])

when using scikit learn for transformations : - use the "fit" method to learn the transformation on the train set - apply it with the "transform" method on train and test sets.

Dealing with categories with scikit learn

In [40]:
# this does a label encoding : 
# every category is mapped to a integer value
# it is ok if you go deep enough with tree based methods. 

for myvar in categorical_columns :
    lbl = LabelEncoder()
    lbl.fit(list(train[myvar].unique()) + list(test[myvar].unique()))
    X_train[myvar] = lbl.transform(X_train[myvar])
    X_test[myvar] = lbl.transform(X_test[myvar])
    X_valid[myvar] = lbl.transform(X_valid[myvar])
In [41]:
# to dummify, ie, generate 1 column per category with 1 or 0,
# try using instead : 
encoder = OneHotEncoder()

improovement for spec and bills :

  1. better coding of specs
  • recreate a column with all specs concatenated with separator ' '
  • recreate a column with 1 if tube has the corresponding spec
  1. same for bills but use also the quantity information !

Use : sklearn.feature_extraction.DictVectorizer

Using a Machine Learning model with scikit learn

In [42]:
# declare what the classifier is
clf = RandomForestRegressor(n_estimators = 100, random_state = 0)
# that is what you tune !
In [43]:
# fitting
clf.fit(X_train, y_train)
Out[43]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=100, n_jobs=1, oob_score=False, random_state=0,
           verbose=0, warm_start=False)
In [44]:
# predicting valid
preds_valid = clf.predict(X_valid)
In [47]:
# and show our score
score = RMSLE(np.exp(preds_valid) - 1 , np.exp(y_valid) - 1)
print 'cross validation score : ', score
# just to remember the inverse transformation, 
# taking the rmse here would have done the trick
cross validation score :  0.286738169705

Is that all ? Declare a model, use fit and then transform ? Yes and it will be always as easy as this one.

Submit to kaggle

In [53]:
preds = clf.predict(X_test)
preds = np.exp(preds) + 1
In [54]:
preds = pd.DataFrame({"id": ids_test, "cost": preds})
preds = preds[['id','cost']]
preds.to_csv('benchmark.csv', index=False)

Of course taking all the data you have available may help your model be more precise.

That's it ! It's up to you to : - add new features from the other datasets - change your model and optimize its hyper parameter - ascend in the leaderboard !