Since DSS 3.1, H2O is natively integrated into DSS virtual machine learning, using Sparkling Water (the Spark / H2O integration layer).
You'll find more information about how to use H2O in visual machine learning in the reference documentation.
The rest of this Howto covers how to use an H2O cluster in custom Python code (without Sparkling Water)
H2O is an open source distributed machine learning library that can work on top of Hadoop or Spark.
This howto will guide you through setting up a basic (non-Spark) H2O cluster, and connecting your DSS to it. You can then use the Python notebooks and recipes to actually perform training and scoring in your H2O cluster.
Go to the “Install on Hadoop” tab and select the download corresponding to your Hadoop version
Follow the instructions on that page:
hadoop jar h2odriver.jar -nodes 1 -mapperXmx 6g -output h2o-output-1
You must install within the DSS Python virtual environment the exact same version that you used to start your cluster.
Note the URL to install the H2O python package (the URL starts with
./bin/pip install <URL>.
For example, if your H2O version is 22.214.171.124, run:
./bin/pip install http://h2o-release.s3.amazonaws.com/h2o/rel-turan/4/Python/h2o-126.96.36.199-py2.py3-none-any.whl
Important note: do not use
./bin/pip install h2o, as the client and cluster versions must exactly match.
Start a Jupyter Python notebook from a DSS project. You can then get started with the sample codes provided by H2O.
Have a look at the Python demos from H2O
For example, this notebook provides a simple prediction code.
Important note: When calling
h2o.init(), use the
h2o.init(ip="", port=) variant and pass the host and port that were printed when you started the H2O cluster.