Thanks to plugins, you can extend the features of Dataiku DSS. You can write your own plugins, use them on your Dataiku DSS instance, and share them with your team or with the world.
If you want to know about using plugins, see our reference guide on plugins.
A plugin in DSS can contain several related components. In this tutorial, we’ll be writing two custom components.
For example, a plugin dedicated to Salesforce could contain several datasets to access several parts of the ecosystem, and several data enrichment recipes.
In this tutorial, we’ll be developing a custom recipe and a custom dataset.
By writing a custom recipe, you can add a new kind of recipe to DSS.
The idea is:
In the Python or R code of the recipe, the developer of the custom recipe uses specific API to retrieve the inputs, outputs and parameters (i.e., the “instantiation parameters”) of the recipe
The user of the recipe never has to touch the code of the recipe. To them, a custom recipe is like a simple DSS visual recipe.
Custom recipes can be used for example to package some common tasks on datasets (audit a dataset, perform some statistical analysis, …)
Custom datasets work in a similar way. However, unlike custom recipes, custom datasets can only be written in Python.
The dataset then behaves like all other DSS datasets. For example, you can then run a preparation recipe on this custom dataset.
Custom datasets can be used, for example, to connect to external data sources like REST APIs.
The first step is to create a new Dataiku DSS Project. From the Dataiku homepage, click +New Project > DSS Tutorials > Code > Create your first DSS plugin.
This includes the example dataset wine_quality.
There are two kinds of plugins in DSS:
Let’s create a dev plugin:
This identifier should be globally unique so we recommend that you prefix it with your company name or something like that.
Now that we have a skeleton for a plugin, we can add some components to it.
Let’s write a custom recipe that computes pairwise correlations (i.e., correlations between the values in pairs of columns). For example, in a car sales dataset, we might discover that the price has a strong anti-correlation to the mileage. This recipe will be written in Python.
We will start by writing a Python recipe in the flow of the tutorial project, and then make it “reusable”.
Create a Python recipe from the “wine_quality” dataset to a new “wine_correlation” dataset.
The recipe code looks like the following:
You can run the recipe and see the output: a dataset with 3 columns (col0, col1, corr) and one row per input columns pair.
Let’s make it a custom recipe:
compute_correlation; we expect to extend use of this recipe beyond the wine dataset
For the rest of the tutorial, we’ll now be tweaking the generated files.
First, let’s have a look at the recipe.json file. The most important thing to change is the “inputRoles” and “outputRoles” arrays. Roles allow you to associate one or more dataset to each “kind” of input and output of the recipe.
Our recipe is a simple one: it has one “input role” with exactly 1 dataset, and one “output role” with exactly 1 dataset. Edit your JSON to look like:
We’d like to allow users of this plugin to be able to focus on “strong” correlations (i.e., the corr figures that are closest to +1 or -1).
We can specify a threshold parameter that can be set in the recipe dialog by editing the
params section of recipe.json:
Now let’s edit
recipe.py. The default contents include some generic starter code for referencing roles and parameters, the code from your Python recipe, and some comments that explain how to finish creating your custom recipe. In the end, your
recipe.py should start with code for retrieving datasets and parameters like:
The portion of your original recipe that reads inputs needs to be updated to refer to the datasets created from the input roles, like:
The portion of your original recipe that computes the correlations should be updated to include the threshold to filter out the weak correlations:
The portion of your original recipe that writes the output datasets also needs to be updated to refer to the datasets created from the output roles, like:
Verify that “wine_quality” or “wine_correlation” don’t appear anymore in your recipe. In general, the rest of
recipe.py can be left as-is.
After editing the recipe.json for a custom recipe, you must do the following:
When modifying the recipe.py file, you don't need to reload anything. Simply run the recipe again.
For our custom dataset, we’re going to read the Dataiku RaaS (Randomness as a Service) REST API.
This API returns random numbers, so we want to use it to extend Dataiku DSS’s functionality.
To use the API, we have to perform a GET query on http://raas.dataiku.com/api.php
For example, visit: http://raas.dataiku.com/api.php?nb=5&max=200&apiKey=secret
This returns 5 random numbers between 0 and 200.
Custom datasets are a bit more complex to write than custom recipes since we can’t base ourselves on a regular recipe.
raasand click Add
We’ll start with the connector.json file. Our custom dataset needs the user to input 3 parameters:
So let’s create our params array:
For the Python part, we need to write a Python class.
In the constructor, we’ll retrieve the parameters:
We know in advance the schema of our dataset: it will only have one column named “random” containing integers. So, in get_read_schema, let’s return this schema
Finally, the core of the connector is the
generate_rows method. This method is a generator over dictionaries. Each yield in the generator becomes a row in the dataset.
If you don’t know about generators in Python, you can have a look at https://wiki.python.org/moin/Generators
We’ll be using the
requests library to perform the API calls.
The final code of our dataset is:
(All other methods are not required at this point, so we removed them).
After editing the connector.json for a custom recipe, you must do the following:
When modifying the connector.py file, you don't need to reload anything.
In the new dataset menu, you can now see your new dataset (try reloading your browser if this is not the case). You are presented with a UI to set the 3 required parameters.
You can now hit Create, and you have created a new type of dataset. You can now do anything with it, like you would do for any other DSS dataset.
There is no specific caching mechanism in custom datasets. Custom datasets are often used to access external APIs, and you may not want to perform another call on the API each time DSS needs to read the input dataset.
It is therefore highly recommended that the first thing you do with a custom dataset is to either use a Prepare or Sync recipe to make a cached version on a first-party data store.
Plugins are distributed as Zip archives. To share your dev plugins, you simply select Download this plugin from the actions menu of the plugin. Alternatively you can compress the contents of the plugin folder to a Zip file (the contents of the plugin folder must be at the root of the Zip file, i.e. there must not be any leading folder in the Zip file). The
plugin.jsonfile should be at the root of the Zip file.
A typical Zip file would therefore have this structure
./plugin.json ./python-connectors/myapi-connector/connector.py ./python-connectors/myapi-connector/connector.json ./python-lib/myapi.py
You can then install this Zip file in any DSS by following the instructions.
We have only scratched the surface of what custom datasets and recipes can do:
To see examples of all of these features, have a look at the code of the publicly available DSS plugins.
In particular, have a look at the samples folder which lists which feature each dataset and recipe in the public repository use.