howto

Writing your first DSS plugin

September 28, 2015

Thanks to plugins, you can extend the features of DSS. You can write your own plugins, use them on your DSS instance, share them with your team or with the world.

If you want to know about using plugins, see our reference guide on plugins.

Custom components

In this tutorial, we'll be writing two custom components.

A plugin in DSS can contain several related components. Custom components can be:

  • Datasets
  • Recipes

For example, a plugin dedicated to Salesforce could contain several datasets to access several parts of the ecosystem, and several data enrichment recipes.

In this tutorial, we'll be developing a custom recipe and a custom dataset

Custom recipe

By writing a custom recipe, you can add a new kind of recipe to DSS.

The idea is:

  • The developer of the custom recipe writes the core of the recipe in Python or R code
  • The developer of the custom recipe writes a JSON descriptor. This JSON descriptor declares
    • The kinds of inputs and outputs of the recipe
    • The available configuration parameters
  • In the Python or R code of the recipe, the developer of the custom recipe uses specific API to retrieve the inputs, outputs and parameters (ie, the "instantiation parameters") of the recipe

  • The user of the recipe is presented with a simple visual interface in which he can enter the declared configuration parameters, and run the recipe.

The user of the recipe never has to touch the code of the recipe. To him, a custom recipe is like a simple DSS visual recipe.

Custom recipes can be used for example to package some common tasks on datasets (audit a dataset, perform some statistical analysis, ...)

Custom datasets

Custom datasets work fairly the same way. However, unlike custom recipes, custom datasets can only be written in Python.

  • The developer of the dataset writes Python code that reads rows from the data source, or writes rows to it
  • The developer of the dataset writes a JSON descriptor that declares the configuration parameters
  • The user of the dataset is presented with a visual interface in which he can enter the configuration parameters.

The dataset then behaves like all other DSS datasets. For example, you can then run a preparation recipe on this custom dataset.

Custom datasets can be used for example to connect to external data sources like REST APIs.

Prerequisites

It is recommended from the DSS home page to click “DSS tutorials” → “Code” → “Create your first DSS plugin”:

This way you get the example dataset wine_quality. But you can also create a new project and use your own datasets.

  • Developing plugins require that you have a good working knowledge of Python and/or R.
  • You also need to have shell access to the server running DSS.
  • To develop plugins, you must be a DSS administrator

Enabling the developer tools

DSS comes with several tools that make it easier to write plugins.

To enable the developer tools.

  • Go to Administration > Settings > Misc
  • Enable the "Plugin development" checkbox
  • Save settings
  • Reload the page in your browser

Now when you click on the Administration icon in the nav bar of DSS, you'll see a new entry "Plugin Developer"

Create a dev plugin

There are two kinds of plugins in DSS:

  • Plugins that you download from the store or that you upload as a Zip file are installed plugins
  • Plugins that you develop in DSS are dev plugins

Let's create a dev plugin:

  • Go to the Plugin developer page
  • Click New dev plugin
  • Give an identifier to your plugin.

This identifier should be globally unique so we recommend that you prefix it with your company name or something like that.

Now that we have a skeleton for a plugin, we can add some components to it.

The pairwise correlations recipe

Let's write a custom recipe that computes pairwise correlations (ie, correlations between the values in pairs of columns). For example, in a car sales dataset, we might discover that the price has a strong anti-correlation to the mileage. This recipe will be written in Python.

We could start writing the Python and JSON files for the recipe manually, but DSS makes it much easier. After all, writing a custom recipe is basically writing a regular recipe, and then making it "reusable".

So let's start by writing a "regular" Python recipe. For that, we'll use an example dataset. It comes preloaded in the tutorial project. This is a dataset about the quality of wine.

Create the base recipe

Create a Python recipe from the "winequality" to a new "winecorrelation" dataset.

The code looks like that

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np

# Read the input
input_dataset = dataiku.Dataset("wine_quality")
df = input_dataset.get_dataframe()
column_names = df.columns

# We'll only compute correlations on numerical columns
# So extract all pairs of names of numerical columns
pairs = []
for i in xrange(0, len(column_names)):
    for j in xrange(i + 1, len(column_names)):
        col1 = column_names[i]
        col2 = column_names[j]
        if df[col1].dtype == "float64" and \
           df[col2].dtype == "float64":
            pairs.append((col1, col2))

# Compute the correlation for each pair, and write a
# row in the output array
output = []
for pair in pairs:
    corr = df[[pair[0], pair[1]]].corr().iloc[0][1]
    output.append({"col0" : pair[0],
                   "col1" : pair[1],
                   "corr" :  corr})

# Write the output to the output dataset
output_dataset =  dataiku.Dataset("wine_correlation")
output_dataset.write_with_schema(pd.DataFrame(output))

You can run the recipe and see the output: a dataset with 3 columns (col0, col1, corr) and one row per input columns pair.

Convert it and make it reusable

Let's make it a custom recipe:

  • Go to the Advanced tab
  • Click "Convert to custom recipe"
  • Select the dev plugin to add the custom recipe to
  • Click convert.
  • DSS generates the custom recipe files and gives us the path

For the rest of the tutorial, we'll now be tweaking the generated files by editing them on the DSS server

If you go right now to the Flow view, click on the "+ Recipe" button, you'll see your new custom recipe available here. Let's go edit it first.

First, let's have a look at the recipe.json file. The most important thing to change is the "inputRoles" and "outputRoles" arrays. Roles allow you to associate one or more dataset to each "kind" of input and output of the recipe.

Our recipe is a simple one: it has one "input role" with exactly 1 dataset, and one "output role" with exactly 1 dataset. Let's call both roles "main". So your JSON should look like

    "inputRoles" : [
        {
            "name": "main",
            "arity": "UNARY",
            "required": true,
            "acceptsDataset": true
        }
    ],

    "outputRoles" : [
        {
            "name": "main",
            "arity": "UNARY",
            "required": true,
            "acceptsDataset": true
        }
    ],

Now let's edit recipe.py. What we need to do is replace the hardcoded dataset names by references to the roles. The recipe.py file has extensive explanation on how to do that. In the end, your recipe.py should start like:

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku.customrecipe import *

# Read the input

# There is only one input, and it's mandatory so we can access [0]
main_input_name = get_input_names_for_role('main')[0]
input_dataset =  dataiku.Dataset(main_input_name)

df = input_dataset.get_dataframe()
column_names = df.columns

and end like:

# Write the output to the output dataset
main_output_name = get_output_names_for_role('main')[0]
output_dataset =  dataiku.Dataset(main_output_name)
output_dataset.write_with_schema(pd.DataFrame(output))

Verify that "winequality" or "winecorrelation" don't appear anymore in your recipe

Don't forget to comment the sample code to access parameters (the lines starting with my_variable =) (we'll see that later)

Use it

About reloads

After editing the recipe.json for a custom recipe, you must do the following:

  • Go to the plugin developer page
  • Click on your dev plugin
  • Click 'Reload descriptors'
  • Reload the DSS page in your browser

When modifying the recipe.py file, you don't need to reload anything. Simply run the recipe again.

  • Go to the Flow
  • In "+ Recipe", select your recipe type
  • You now have the normal recipe creation tab
  • Select the wine_quality input dataset
  • Create a new output dataset
  • Run the recipe
  • Congratulations, you have created your first DSS custom recipe!

Add a parameter

We'd like to only show the "strong" correlations (ie, the corr figures that are closest to +1 or -1).

Let's give the ability to the user to select his threshold.

Edit the recipe.json and declare just one parameter:

    "params": [
        {
            "name": "threshold",
            "type": "DOUBLE",
            "defaultValue" : 0.5
        }
    ]

In recipe.py, retrieve the parameter at the beginning:

df = input_dataset.get_dataframe()
column_names = df.columns

# Warning: get_recipe_config always returns strings.
threshold = float(get_recipe_config()['threshold'])

And filter out the "bad" correlations:

for pair in pairs:
    corr = df[[pair[0], pair[1]]].corr().iloc[0][1]
    if np.abs(corr) > threshold:
      output.append({"col0" : pair[0],
                     "col1" : pair[1],
                     "corr" :  corr})

Reload everything and go back to the recipe screen. Set the threshold to 0.65 for example. Only two correlations remain!

The RaaS API dataset

For our custom dataset, we're going to read a truly wonderful REST API: the Dataiku RaaS (Randomness as a Service) API.

Simply put, this is an API that returns random numbers. Random numbers are very useful, so we want to extend DSS with ability to use this API.

To use the API, we have to perform a GET query on http://raas.dataiku.com/api.php

For example, visit: http://raas.dataiku.com/api.php?nb=5&max=200&apiKey=secret

This returns 5 random numbers between 0 and 200.

Create the custom dataset

Custom datasets are a bit more complex to write than custom recipes since we can't base ourselves on a regular recipe.

  • Go to the plugin developer page
  • Create a new dev plugin (or reuse the previous one)
  • In the dev plugin page, click on "Add Python Dataset"
  • Use the editor to modify files. (Alternatively, on the DSS server, go to the folder of the plugin and go to python-connectors/your-dataset-id)

We'll start with the connector.json file. Our custom dataset needs the user to input 3 parameters:

  • Number of random numbers
  • Range
  • API Key

So let's create our params array:

    "params": [
        {
            "name": "apiKey",
            "label": "RAAS API Key",
            "type": "STRING",
            "description" : "You can enter more help here"
        },
        {
            "name": "nb",
            "label": "Number of random numbers",
            "type": "INT",
            "defaultValue" : 10 /* You can have the data prefilled */
        },
        {
            "name": "max",
            "label": "Max value",
            "type": "INT"
        }
    ],

For the Python part, we need to write a Python class.

In the constructor, we'll retrieve the parameters:

self.key = self.config["apiKey"]
self.nb = int(self.config["nb"])
self.max = int(self.config["max"])

We know in advance the schema of our dataset: it will only have one row named "random" containing integers. So, in get_read_schema, let's return this schema

def get_read_schema(self):
    return {
        "columns" : [
            { "name" : "random", "type" : "int" }
        ]
    }

Finally, the core of the connector is the generate_rows method. This method is a generator over dictionaries. Each yield in the generator becomes a row in the dataset.

If you don't know about generators in Python, you can have a look at https://wiki.python.org/moin/Generators

We'll be using the requests library to perform the API calls.

The final code of our dataset is:

from dataiku.connector import Connector
import requests

class MyConnector(Connector):

    def __init__(self, config):
        Connector.__init__(self, config)  # pass the parameters to the base class

        self.key = self.config["apiKey"]
        self.nb = int(self.config["nb"])
        self.max = int(self.config["max"])

    def get_read_schema(self):
        return {
            "columns" : [
                { "name" : "random", "type" : "int" }
            ]
        }

    def generate_rows(self, dataset_schema=None, dataset_partitioning=None,
                            partition_id=None, records_limit = -1):

        req = requests.get("http://raas.dataiku.com/api.php", params = {
            "apiKey": self.key,
            "nb":self.nb,
            "max":self.max
        })

        array = req.json()
        for random_number in array:
            yield { "random"  : random_number}

(All other methods are not required at this point, so we removed them).

Use it

About reloads

After editing the connector.json for a custom recipe, you must do the following:

  • Go to the plugin developer page
  • Click on your dev plugin
  • Click 'Reload descriptors'
  • Reload the DSS page in your browser

When modifying the connector.py file, you don't need to reload anything.

In the new dataset menu, you can now see your new dataset (try reloading your browser if this is not the case). You are presented with a UI to set the 3 required parameters.

  • Set "secret" as API Key
  • Set anything as nb and max
  • Click test
  • Your random numbers appear!

You can now hit Create, and you have created a new type of dataset. You can now do anything with it, like you would do for any other DSS dataset.

About caching

There is no specific caching mechanism in custom datasets. Custom datasets are often used to access external APIs, and you may not want to perform another call on the API each time DSS needs to read the input dataset.

It is therefore highly recommended that the first thing you do with a custom dataset is to either use a Prepare or Sync recipe to make a cached version on a first-party data store.

Sharing your plugin

Plugins are distributed as Zip archives. To share your dev plugins, you simply need to compress the contents of the plugin folder to a Zip file (the contents of the plugin folder must be at the root of the Zip file, i.e. there must not be any leading folder in the Zip file). The plugin.jsonfile should be at the root of the Zip file.

A typical Zip file would therefore have this structure

./plugin.json
./python-datasets/myapi-connector/connector.py
./python-datasets/myapi-connector/connector.json
./python-lib/myapi.py

You can then install this Zip file in any DSS by following the instructions.

If you want to distribute your plugin to all DSS users, we'd be glad to discuss it with you. Head over to our contributions repository or contact us!

Going further

We have only scratched the surface of what custom datasets and recipes can do:

To see examples of all of these features, have a look at the code of the publicly available DSS plugins.

In particular, have a look at the samples folder which lists which feature each dataset and recipe in the public repository use.