howto

Use Theano and Tensorflow with CUDA in Dataiku DSS

October 11, 2016

Do you want to train deep neural networks? Then you need to use a numerical computation library! If you want to find out what deep learning is, then read my previous introduction.

We focus here on the integration of two popular numerical computation libraries with Dataiku DSS:

  • Theano, originally developed by the machine learning group at the Université de Montréal.
  • Tensorflow, originally developed by the research group at Google Brain and now open sourced.

Since training neural networks requires doing lots of numerical computations, it's essential to get one of these installed before you start doing some serious deep learning. In addition, you will probably want to use a GPU to further speed up some of the calculations.

Note that DSS now includes H2O as a machine learning engine, offering deep neural networks as one of the available algorithms.

GPU support prerequisites

Both Tensorflow and Theano support using CUDA (a computing platform for NVIDIA GPUs). You machine must have an NVIDIA GPU for this to work.

  1. Install CUDA from NVIDIA
  2. Register as a CUDA developer and install cuDNN.

Note

Check the tensorflow documentation but currently the best supported combination is CUDA v7.5 and cuDNN v5.
Make sure you remember where you installed CUDA and cudNN.

Theano

To install Theano in the Dataiku python virtual environment, run:

DATAIKU_DATA_DIR/bin/pip install Theano

Don't forget to setup your Theano configuration file with the options you want. Make sure you have the required dependencies for your distribution.

Tensorflow

To install Tensorflow in the Dataiku python virtual environment run:

DATAIKU_DATA_DIR/bin/pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.11.0rc1-cp27-none-linux_x86_64.whl

This will install the latest version as of 2016/10/24. You can check the latest version, or versions for different architectures in the tensorflow documentation.

Dataiku setup

Once you have setup everything, you need to tell Dataiku where to find the libraries. In the file DATAIKU_DATA_DIR/bin/env-site.sh add the following lines:

export LD_LIBRARY_PATH=/usr/local/cuda/lib64
export CUDA_HOME=/usr/local/cuda/bin
export PATH=/usr/local/cuda/bin:$PATH

Note

If you installed any of the above libraries in a non standard location, change the paths accordingly.

Testing the installations

For Theano run:

source DATAIKU_DATA_DIR/bin/env-site.sh
THEANO_FLAGS=floatX=float32,device=gpu
DATAIKU_DATA_DIR/bin/python -m theano.misc.check_blas

For Tensorflow run:

source DATAIKU_DATA_DIR/bin/env-site.sh
DATAIKU_DATA_DIR/bin/python -m tensorflow.models.image.mnist.convolutional --self-test

Finally, you can restart Dataiku. And you're good to go!

Glossary

CPU: central processing unit

GPU: graphics processing unit

DATAIKU_DATA_DIR: the Dataiku data directory