Since DSS 3.1, H2O is natively integrated into DSS virtual machine learning, using Sparkling Water (the Spark / H2O integration layer).
You'll find more information about how to use H2O in visual machine learning in the reference documentation.
The rest of this Howto covers how to use an H2O cluster in custom Python code (without Sparkling Water)
H2O is an open source distributed machine learning library that can work on top of Hadoop or Spark.
This howto will guide you through setting up a basic (non-Spark) H2O cluster, and connecting your DSS to it. You can then use the Python notebooks and recipes to actually perform training and scoring in your H2O cluster.
Using H2O over Hadoop using Python
Download the main H2O package
Go to the "Install on Hadoop" tab and select the download corresponding to your Hadoop version
Follow the instructions on that page:
- Unzip the downloaded package
- Run with
hadoop jar h2odriver.jar -nodes 1 -mapperXmx 6g -output h2o-output-1
- H2O starts (this will take 30s - 1 minute) and prints its connection address ("Open H2O Flow in your web browser ...").
- Write down the address somewhere
- You must manually remove the output folder between each run, else H2O doesn't start)
- If the command returns, it means something went wrong. Look at the logs for any error
Install the Python library
You must install within the DSS Python virtual environment the exact same version that you used to start your cluster.
- Go to the "Install in Python" tab of the same release page
Note the URL to install the H2O python package (the URL starts with
Go to the data dir of your DSS installation
./bin/pip install <URL>. For example, if your H2O version is 126.96.36.199, run:
./bin/pip install http://h2o-release.s3.amazonaws.com/h2o/rel-turan/4/Python/h2o-188.8.131.52-py2.py3-none-any.whl
Important note: do not use
./bin/pip install h2o, as the client and cluster versions must exactly match.
Execute a sample code
Start a Jupyter Python notebook from a DSS project. You can then get started with the sample codes provided by H2O.
Have a look at the Python demos from H2O
For example, this notebook provides a simple prediction code.
Important note: When calling
h2o.init(), use the
h2o.init(ip="", port=) variant and pass the host and port that were printed when you started the H2O cluster.