howto

Writing your first Dataiku DSS plugin

Applies to DSS 2.1 and above | September 28, 2015

Thanks to plugins, you can extend the features of Dataiku DSS. You can write your own plugins, use them on your Dataiku DSS instance, and share them with your team or with the world.

If you want to know about using plugins, see our reference guide on plugins.

Custom components

A plugin in DSS can contain several related components. In this tutorial, we’ll be writing two custom components.

For example, a plugin dedicated to Salesforce could contain several datasets to access several parts of the ecosystem, and several data enrichment recipes.

In this tutorial, we’ll be developing a custom recipe and a custom dataset.

Custom recipe

By writing a custom recipe, you can add a new kind of recipe to DSS.

The idea is:

  • The developer of the custom recipe writes the core of the recipe in Python or R code
  • The developer of the custom recipe writes a JSON descriptor. This JSON descriptor declares
    • The kinds of inputs and outputs of the recipe
    • The available configuration parameters
  • In the Python or R code of the recipe, the developer of the custom recipe uses specific API to retrieve the inputs, outputs and parameters (i.e., the “instantiation parameters”) of the recipe

  • The user of the recipe is presented with a simple visual interface in which they can enter the declared configuration parameters, and run the recipe.

The user of the recipe never has to touch the code of the recipe. To them, a custom recipe is like a simple DSS visual recipe.

Custom recipes can be used for example to package some common tasks on datasets (audit a dataset, perform some statistical analysis, …)

Custom datasets

Custom datasets work in a similar way. However, unlike custom recipes, custom datasets can only be written in Python.

  • The developer of the dataset writes Python code that reads rows from the data source, or writes rows to it
  • The developer of the dataset writes a JSON descriptor that declares the configuration parameters
  • The user of the dataset is presented with a visual interface in which he can enter the configuration parameters.

The dataset then behaves like all other DSS datasets. For example, you can then run a preparation recipe on this custom dataset.

Custom datasets can be used, for example, to connect to external data sources like REST APIs.

Prerequisites

The first step is to create a new Dataiku DSS Project. From the Dataiku homepage, click +New Project > DSS Tutorials > Code > Create your first DSS plugin.

This includes the example dataset wine_quality.

  • Developing plugins require that you have a good working knowledge of Python and/or R.
  • You also need to have shell access to the server running Dataiku DSS.
  • You must belong to a group that has the Develop Plugins permission.

Create a dev plugin

There are two kinds of plugins in DSS:

  • Plugins that you download from the store or that you upload as a Zip file are installed plugins.
  • Plugins that you develop in DSS are dev plugins.

Let’s create a dev plugin:

  • From the application menu, choose Plugins development.
  • Click +New dev plugin.
  • Give an identifier to your plugin.

This identifier should be globally unique so we recommend that you prefix it with your company name or something like that.

Now that we have a skeleton for a plugin, we can add some components to it.

The pairwise correlations recipe

Let’s write a custom recipe that computes pairwise correlations (i.e., correlations between the values in pairs of columns). For example, in a car sales dataset, we might discover that the price has a strong anti-correlation to the mileage. This recipe will be written in Python.

We will start by writing a Python recipe in the flow of the tutorial project, and then make it “reusable”.

Create the base recipe

Create a Python recipe from the “wine_quality” dataset to a new “wine_correlation” dataset.

The recipe code looks like the following:

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np

# Read the input
input_dataset = dataiku.Dataset("wine_quality")
df = input_dataset.get_dataframe()
column_names = df.columns

# We'll only compute correlations on numerical columns
# So extract all pairs of names of numerical columns
pairs = []
for i in xrange(0, len(column_names)):
    for j in xrange(i + 1, len(column_names)):
        col1 = column_names[i]
        col2 = column_names[j]
        if df[col1].dtype == "float64" and \
           df[col2].dtype == "float64":
            pairs.append((col1, col2))

# Compute the correlation for each pair, and write a
# row in the output array
output = []
for pair in pairs:
    corr = df[[pair[0], pair[1]]].corr().iloc[0][1]
    output.append({"col0" : pair[0],
                   "col1" : pair[1],
                   "corr" :  corr})

# Write the output to the output dataset
output_dataset =  dataiku.Dataset("wine_correlation")
output_dataset.write_with_schema(pd.DataFrame(output))

You can run the recipe and see the output: a dataset with 3 columns (col0, col1, corr) and one row per input columns pair.

Convert it and make it reusable

Let’s make it a custom recipe:

  • Go to the Advanced tab of the Python recipe
  • Click Convert to custom recipe
  • Select the dev plugin to add the custom recipe to
  • Choose to place it under folder compute_correlation; we expect to extend use of this recipe beyond the wine dataset
  • Click Convert
  • DSS generates the custom recipe files and suggests we edit them now in the Plugin Developer. Let’s do that now.

For the rest of the tutorial, we’ll now be tweaking the generated files.

Edit definitions in recipe.json

First, let’s have a look at the recipe.json file. The most important thing to change is the “inputRoles” and “outputRoles” arrays. Roles allow you to associate one or more dataset to each “kind” of input and output of the recipe.

Our recipe is a simple one: it has one “input role” with exactly 1 dataset, and one “output role” with exactly 1 dataset. Edit your JSON to look like:

    "inputRoles" : [
        {
            "name": "input",
            "label": "Input dataset",
            "description": "The dataset containing the raw data from which we'll compute correlations.",
            "arity": "UNARY",
            "required": true,
            "acceptsDataset": true
        }
    ],

    "outputRoles" : [
        {
            "name": "main_output",
            "label": "Output dataset",
            "description": "The dataset containing the correlations.",
            "arity": "UNARY",
            "required": true,
            "acceptsDataset": true
        }
    ],

We’d like to allow users of this plugin to be able to focus on “strong” correlations (i.e., the corr figures that are closest to +1 or -1).

We can specify a threshold parameter that can be set in the recipe dialog by editing the params section of recipe.json:

"params": [
    {
        "name": "threshold",
        "label" : "Threshold for showing a correlation",
        "type": "DOUBLE",
        "defaultValue" : 0.5,
        "description":"Correlations below the threshold will not appear in the output dataset",
        "mandatory" : true
    },
],

Edit code in recipe.py

Now let’s edit recipe.py. The default contents include some generic starter code for referencing roles and parameters, the code from your Python recipe, and some comments that explain how to finish creating your custom recipe. In the end, your recipe.py should start with code for retrieving datasets and parameters like:

# Retrieve array of dataset names from 'input' role, then create datasets
input_names = get_input_names_for_role('input')
input_datasets = [dataiku.Dataset(name) for name in input_names]

# For outputs, the process is the same:
output_names = get_output_names_for_role('main_output')
output_datasets = [dataiku.Dataset(name) for name in output_names]

# Retrieve parameter values from the of map of parameters
threshold = get_recipe_config()['threshold']

The portion of your original recipe that reads inputs needs to be updated to refer to the datasets created from the input roles, like:

# Read the input
input_dataset = input_datasets[0]
df = input_dataset.get_dataframe()
column_names = df.columns

The portion of your original recipe that computes the correlations should be updated to include the threshold to filter out the weak correlations:

for pair in pairs:
    corr = df[[pair[0], pair[1]]].corr().iloc[0][1]
    if np.abs(corr) > threshold:
      output.append({"col0" : pair[0],
                     "col1" : pair[1],
                     "corr" :  corr})

The portion of your original recipe that writes the output datasets also needs to be updated to refer to the datasets created from the output roles, like:

# Write the output to the output dataset
output_dataset =  output_datasets[0]
output_dataset.write_with_schema(pd.DataFrame(output))

Verify that “wine_quality” or “wine_correlation” don’t appear anymore in your recipe. In general, the rest of recipe.py can be left as-is.

Use your custom recipe in the flow

About reloads

After editing the recipe.json for a custom recipe, you must do the following:

  • Go to the plugin developer page
  • Open your dev plugin
  • Click Reload
  • Reload the Dataiku DSS page in your browser

When modifying the recipe.py file, you don't need to reload anything. Simply run the recipe again.

  • Go to the Flow
  • In “+ Recipe”, select your recipe type
  • You now have the normal recipe creation tab
  • Select the wine_quality input dataset
  • Create a new output dataset
  • Run the recipe, editing the default threshold value if you desire
  • Congratulations, you have created your first DSS custom recipe!

The RaaS API dataset

For our custom dataset, we’re going to read the Dataiku RaaS (Randomness as a Service) REST API.

This API returns random numbers, so we want to use it to extend Dataiku DSS’s functionality.

To use the API, we have to perform a GET query on http://raas.dataiku.com/api.php

For example, visit: http://raas.dataiku.com/api.php?nb=5&max=200&apiKey=secret

This returns 5 random numbers between 0 and 200.

Create the custom dataset

Custom datasets are a bit more complex to write than custom recipes since we can’t base ourselves on a regular recipe.

  • Go to the plugin developer page
  • Create a new dev plugin (or reuse the previous one)
  • In the dev plugin page, click on +Add Component
  • Choose Dataset
  • Select Python as the language
  • Give the new dataset type an id, like raas and click Add
  • Use the editor to modify files.

We’ll start with the connector.json file. Our custom dataset needs the user to input 3 parameters:

  • Number of random numbers
  • Range
  • API Key

So let’s create our params array:

    "params": [
        {
            "name": "apiKey",
            "label": "RAAS API Key",
            "type": "STRING",
            "description" : "You can enter more help here"
        },
        {
            "name": "nb",
            "label": "Number of random numbers",
            "type": "INT",
            "defaultValue" : 10 /* You can have the data prefilled */
        },
        {
            "name": "max",
            "label": "Max value",
            "type": "INT"
        }
    ]

For the Python part, we need to write a Python class.

In the constructor, we’ll retrieve the parameters:

# perform some more initialization
self.key = self.config["apiKey"]
self.nb = int(self.config["nb"])
self.max = int(self.config["max"])

We know in advance the schema of our dataset: it will only have one column named “random” containing integers. So, in get_read_schema, let’s return this schema

def get_read_schema(self):
    return {
        "columns" : [
            { "name" : "random", "type" : "int" }
        ]
    }

Finally, the core of the connector is the generate_rows method. This method is a generator over dictionaries. Each yield in the generator becomes a row in the dataset.

If you don’t know about generators in Python, you can have a look at https://wiki.python.org/moin/Generators

We’ll be using the requests library to perform the API calls.

The final code of our dataset is:

from dataiku.connector import Connector
import requests

class MyConnector(Connector):

    def __init__(self, config):
        Connector.__init__(self, config)  # pass the parameters to the base class

        self.key = self.config["apiKey"]
        self.nb = int(self.config["nb"])
        self.max = int(self.config["max"])

    def get_read_schema(self):
        return {
            "columns" : [
                { "name" : "random", "type" : "int" }
            ]
        }

    def generate_rows(self, dataset_schema=None, dataset_partitioning=None,
                            partition_id=None, records_limit = -1):

        req = requests.get("http://raas.dataiku.com/api.php", params = {
            "apiKey": self.key,
            "nb":self.nb,
            "max":self.max
        })

        array = req.json()
        for random_number in array:
            yield { "random"  : random_number}

(All other methods are not required at this point, so we removed them).

Use the plugin

About reloads

After editing the connector.json for a custom recipe, you must do the following:

  • Go to the plugin developer page
  • Open your dev plugin
  • Click 'Reload'
  • Reload the DSS page in your browser

When modifying the connector.py file, you don't need to reload anything.

In the new dataset menu, you can now see your new dataset (try reloading your browser if this is not the case). You are presented with a UI to set the 3 required parameters.

  • Set “secret” as API Key
  • Set anything as nb and max
  • Click test
  • Your random numbers appear!

You can now hit Create, and you have created a new type of dataset. You can now do anything with it, like you would do for any other DSS dataset.

About caching

There is no specific caching mechanism in custom datasets. Custom datasets are often used to access external APIs, and you may not want to perform another call on the API each time DSS needs to read the input dataset.

It is therefore highly recommended that the first thing you do with a custom dataset is to either use a Prepare or Sync recipe to make a cached version on a first-party data store.

Sharing your plugin

Plugins are distributed as Zip archives. To share your dev plugins, you simply select Download this plugin from the actions menu of the plugin. Alternatively you can compress the contents of the plugin folder to a Zip file (the contents of the plugin folder must be at the root of the Zip file, i.e. there must not be any leading folder in the Zip file). The plugin.jsonfile should be at the root of the Zip file.

A typical Zip file would therefore have this structure

./plugin.json
./python-connectors/myapi-connector/connector.py
./python-connectors/myapi-connector/connector.json
./python-lib/myapi.py

You can then install this Zip file in any DSS by following the instructions.

If you want to distribute your plugin to all DSS users, we’d be glad to discuss it with you. Head over to our contributions repository or contact us!

Going further

We have only scratched the surface of what custom datasets and recipes can do:

To see examples of all of these features, have a look at the code of the publicly available DSS plugins.

In particular, have a look at the samples folder which lists which feature each dataset and recipe in the public repository use.