howto

Using H2O in DSS (Python)

April 03, 2016

Since DSS 3.1, H2O is natively integrated into DSS virtual machine learning, using Sparkling Water (the Spark / H2O integration layer).
You'll find more information about how to use H2O in visual machine learning in the reference documentation.

The rest of this Howto covers how to use an H2O cluster in custom Python code (without Sparkling Water)

H2O is an open source distributed machine learning library that can work on top of Hadoop or Spark.

This howto will guide you through setting up a basic (non-Spark) H2O cluster, and connecting your DSS to it. You can then use the Python notebooks and recipes to actually perform training and scoring in your H2O cluster.

Using H2O over Hadoop using Python

Download the main H2O package

  • Go to the H2O website and locate the download page for the current release. At the time of writing, the current release is 3.8.1.4 (Turan)

  • Go to the "Install on Hadoop" tab and select the download corresponding to your Hadoop version

  • Follow the instructions on that page:

    • Unzip the downloaded package
    • Run with hadoop jar h2odriver.jar -nodes 1 -mapperXmx 6g -output h2o-output-1
    • H2O starts (this will take 30s - 1 minute) and prints its connection address ("Open H2O Flow in your web browser ...").
    • Write down the address somewhere

Important notes:

  • You must manually remove the output folder between each run, else H2O doesn't start)
  • If the command returns, it means something went wrong. Look at the logs for any error

Install the Python library

You must install within the DSS Python virtual environment the exact same version that you used to start your cluster.

  • Go to the "Install in Python" tab of the same release page
  • Note the URL to install the H2O python package (the URL starts with http://h2o-release.s3.amawzonaws.com)

  • Go to the data dir of your DSS installation

  • Run ./bin/pip install <URL>. For example, if your H2O version is 3.8.1.4, run: ./bin/pip install http://h2o-release.s3.amazonaws.com/h2o/rel-turan/4/Python/h2o-3.8.1.4-py2.py3-none-any.whl

  • Restart DSS: ./bin/dss restart

Important note: do not use ./bin/pip install h2o, as the client and cluster versions must exactly match.

Execute a sample code

Start a Jupyter Python notebook from a DSS project. You can then get started with the sample codes provided by H2O.

Have a look at the Python demos from H2O

For example, this notebook provides a simple prediction code.

Important note: When calling h2o.init(), use the h2o.init(ip="", port=) variant and pass the host and port that were printed when you started the H2O cluster.