Do you want to train deep neural networks? Then you need to use a numerical computation library! If you want to find out what deep learning is, then read my previous introduction.
We focus here on the integration of two popular numerical computation libraries with Dataiku DSS:
Since training neural networks requires doing lots of numerical computations, it's essential to get one of these installed before you start doing some serious deep learning. In addition, you will probably want to use a GPU to further speed up some of the calculations.
Note that DSS now includes H2O as a machine learning engine, offering deep neural networks as one of the available algorithms.
Both Tensorflow and Theano support using CUDA (a computing platform for NVIDIA GPUs). You machine must have an NVIDIA GPU for this to work.
To install Theano in the Dataiku python virtual environment, run:
DATAIKU_DATA_DIR/bin/pip install Theano
Don't forget to setup your Theano configuration file with the options you want. Make sure you have the required dependencies for your distribution.
To install Tensorflow in the Dataiku python virtual environment run:
DATAIKU_DATA_DIR/bin/pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.11.0rc1-cp27-none-linux_x86_64.whl
This will install the latest version as of 2016/10/24. You can check the latest version, or versions for different architectures in the tensorflow documentation.
Once you have setup everything, you need to tell Dataiku where to find the libraries. In the file
DATAIKU_DATA_DIR/bin/env-site.sh add the following lines:
export LD_LIBRARY_PATH=/usr/local/cuda/lib64 export CUDA_HOME=/usr/local/cuda/bin export PATH=/usr/local/cuda/bin:$PATH
If you installed any of the above libraries in a non standard location, change the paths accordingly.
For Theano run:
source DATAIKU_DATA_DIR/bin/env-site.sh THEANO_FLAGS=floatX=float32,device=gpu DATAIKU_DATA_DIR/bin/python -m theano.misc.check_blas
For Tensorflow run:
source DATAIKU_DATA_DIR/bin/env-site.sh DATAIKU_DATA_DIR/bin/python -m tensorflow.models.image.mnist.convolutional --self-test
Finally, you can restart Dataiku. And you're good to go!
CPU: central processing unit
GPU: graphics processing unit
DATAIKU_DATA_DIR: the Dataiku data directory