Thanks to plugins, you can extend the features of DSS. You can write your own plugins, use them on your DSS instance, share them with your team or with the world.
If you want to know about using plugins, see our reference guide on plugins.
In this tutorial, we’ll be writing two custom components.
A plugin in DSS can contain several related components. Custom components can be:
For example, a plugin dedicated to Salesforce could contain several datasets to access several parts of the ecosystem, and several data enrichment recipes.
In this tutorial, we’ll be developing a custom recipe and a custom dataset
By writing a custom recipe, you can add a new kind of recipe to DSS.
The idea is:
In the Python or R code of the recipe, the developer of the custom recipe uses specific API to retrieve the inputs, outputs and parameters (ie, the “instantiation parameters”) of the recipe
The user of the recipe never has to touch the code of the recipe. To him, a custom recipe is like a simple DSS visual recipe.
Custom recipes can be used for example to package some common tasks on datasets (audit a dataset, perform some statistical analysis, …)
Custom datasets work fairly the same way. However, unlike custom recipes, custom datasets can only be written in Python.
The dataset then behaves like all other DSS datasets. For example, you can then run a preparation recipe on this custom dataset.
Custom datasets can be used for example to connect to external data sources like REST APIs.
It is recommended from the DSS home page to click “DSS tutorials” → “Code” → “Create your first DSS plugin”:
This way you get the example dataset wine_quality. But you can also create a new project and use your own datasets.
DSS comes with several tools that make it easier to write plugins.
To enable the developer tools.
Now when you click on the Administration icon in the nav bar of DSS, you’ll see a new entry “Plugin Developer”
There are two kinds of plugins in DSS:
Let’s create a dev plugin:
This identifier should be globally unique so we recommend that you prefix it with your company name or something like that.
Now that we have a skeleton for a plugin, we can add some components to it.
Let’s write a custom recipe that computes pairwise correlations (ie, correlations between the values in pairs of columns). For example, in a car sales dataset, we might discover that the price has a strong anti-correlation to the mileage. This recipe will be written in Python.
We could start writing the Python and JSON files for the recipe manually, but DSS makes it much easier. After all, writing a custom recipe is basically writing a regular recipe, and then making it “reusable”.
So let’s start by writing a “regular” Python recipe. For that, we’ll use an example dataset. It comes preloaded in the tutorial project. This is a dataset about the quality of wine.
Create a Python recipe from the “wine_quality” to a new “wine_correlation” dataset.
The code looks like that
You can run the recipe and see the output: a dataset with 3 columns (col0, col1, corr) and one row per input columns pair.
Let’s make it a custom recipe:
For the rest of the tutorial, we’ll now be tweaking the generated files by editing them on the DSS server
If you go right now to the Flow view, click on the “+ Recipe” button, you’ll see your new custom recipe available here. Let’s go edit it first.
First, let’s have a look at the recipe.json file. The most important thing to change is the “inputRoles” and “outputRoles” arrays. Roles allow you to associate one or more dataset to each “kind” of input and output of the recipe.
Our recipe is a simple one: it has one “input role” with exactly 1 dataset, and one “output role” with exactly 1 dataset. Let’s call both roles “main”. So your JSON should look like
Now let’s edit recipe.py. What we need to do is replace the hardcoded dataset names by references to the roles. The recipe.py file has extensive explanation on how to do that. In the end, your recipe.py should start like:
and end like:
Verify that “wine_quality” or “wine_correlation” don’t appear anymore in your recipe
Don’t forget to comment the sample code to access parameters (the lines starting with
my_variable =) (we’ll see that later)
After editing the recipe.json for a custom recipe, you must do the following:
When modifying the recipe.py file, you don't need to reload anything. Simply run the recipe again.
We’d like to only show the “strong” correlations (ie, the corr figures that are closest to +1 or -1).
Let’s give the ability to the user to select his threshold.
Edit the recipe.json and declare just one parameter:
In recipe.py, retrieve the parameter at the beginning:
And filter out the “bad” correlations:
Reload everything and go back to the recipe screen. Set the threshold to 0.65 for example. Only two correlations remain!
For our custom dataset, we’re going to read a truly wonderful REST API: the Dataiku RaaS (Randomness as a Service) API.
Simply put, this is an API that returns random numbers. Random numbers are very useful, so we want to extend DSS with ability to use this API.
To use the API, we have to perform a GET query on http://raas.dataiku.com/api.php
For example, visit: http://raas.dataiku.com/api.php?nb=5&max=200&apiKey=secret
This returns 5 random numbers between 0 and 200.
Custom datasets are a bit more complex to write than custom recipes since we can’t base ourselves on a regular recipe.
We’ll start with the connector.json file. Our custom dataset needs the user to input 3 parameters:
So let’s create our params array:
For the Python part, we need to write a Python class.
In the constructor, we’ll retrieve the parameters:
We know in advance the schema of our dataset: it will only have one column named “random” containing integers. So, in get_read_schema, let’s return this schema
Finally, the core of the connector is the
generate_rows method. This method is a generator over dictionaries. Each yield in the generator becomes a row in the dataset.
If you don’t know about generators in Python, you can have a look at https://wiki.python.org/moin/Generators
We’ll be using the
requests library to perform the API calls.
The final code of our dataset is:
(All other methods are not required at this point, so we removed them).
After editing the connector.json for a custom recipe, you must do the following:
When modifying the connector.py file, you don't need to reload anything.
In the new dataset menu, you can now see your new dataset (try reloading your browser if this is not the case). You are presented with a UI to set the 3 required parameters.
You can now hit Create, and you have created a new type of dataset. You can now do anything with it, like you would do for any other DSS dataset.
There is no specific caching mechanism in custom datasets. Custom datasets are often used to access external APIs, and you may not want to perform another call on the API each time DSS needs to read the input dataset.
It is therefore highly recommended that the first thing you do with a custom dataset is to either use a Prepare or Sync recipe to make a cached version on a first-party data store.
Plugins are distributed as Zip archives. To share your dev plugins, you simply need to compress the contents of the plugin folder to a Zip file (the contents of the plugin folder must be at the root of the Zip file, i.e. there must not be any leading folder in the Zip file). The
plugin.jsonfile should be at the root of the Zip file.
A typical Zip file would therefore have this structure
./plugin.json ./python-datasets/myapi-connector/connector.py ./python-datasets/myapi-connector/connector.json ./python-lib/myapi.py
You can then install this Zip file in any DSS by following the instructions.
We have only scratched the surface of what custom datasets and recipes can do:
To see examples of all of these features, have a look at the code of the publicly available DSS plugins.
In particular, have a look at the samples folder which lists which feature each dataset and recipe in the public repository use.