howto

Visual data preparation

February 01, 2017

To familiarize yourself with the interface of Dataiku DSS, you should first follow our tutorials, before reading this page.

The visual preparation recipe gives you access to over 80 built-in visual processors for code-free data wrangling. From text replacements to enrichment of complex data types or various reshaping operations, these processors will help you preparing your data for the next steps.

To create a prepare recipe, select a dataset from the flow and in the actions panel select “Prepare”.

Flow with Prepare recipe selection in the Actions sidebar

You can add additional steps in your preparation script by clicking on “add a new step” in the left part of the screen. When you add a processor to your script, you will see a live result of its effect on the sample of your data displayed in the explore view.

Prepare recipe script

Group steps in your scripts by clicking on “add a group”, you can then drag and drop steps in different groups. Don’t forget to rename them to help your team quickly identify the different steps of your data preparation.

Grouping steps in a script

Reshaping data

Some processors of the prepare recipe can be used to reshape your data. Here is a list of these processors and of their effect:

  • Transpose: Transpose flips your data so that rows become columns and columns become rows.

  • Pivot: The Pivot processor transforms multiple rows into columns. It uses a column as index, another as labels, and a third as values. It will create one line per distinct index, as many columns as there are labels, and fill them with the associated values.

  • Fold: Folding takes values from multiple columns and transforms them to one line per column. This operation is the opposite of a Pivot.

  • Unfold: Unfolding is used for categorical data and transforms cell values into binary columns. This process is also called “Dummification”.

  • Split and fold: The split and fold operation creates new lines by splitting the values within a column on a delimiter.

Enriching data

Processors

Many of the processors available in prepare recipes can be used to enrich your data, especially when you handle complex data types. Here are a few examples:

  • Parsing dates: Parsing dates enables you to extract date related information and create new columns based on this.
  • Classify User-Agent: This processor parses and extracts information from a browser’s User-Agent string.
  • Resolve GeoIP: This processor resolves geographic information about an IP address.
  • Enrich from French postcode: This processor takes a column containing a French post code and outputs several columns with demographic data about the cities using this post code.
  • Geocode (API): This processor performs forward geocoding, by using an external API.
  • Reverse geocoding (plugin): This processor performs a reverse-geocoding (latitude / longitude -> address).

Visual recipes

Complex data enrichment operations can be carried out using the visual recipes of the studio. Here are the most useful for data enrichment:

  • Group recipe icon Group recipe: This recipe allows you to compute aggregations on any dataset, whether it’s a SQL dataset or not. You chose the keys that will correspond to unique lines in the output dataset, and select how to aggregate the values in all other column for each key combination.

  • Window recipe icon Window recipe: Compute sliding aggregations on a dataset where each line corresponds to, for example, a given time. Set up the Window Frame you want to compute by defining how many lines below and after each record you want to use, define what column to order your dataset with, and select aggregations for your features.

  • Join with recipe icon Join with… recipe: Enrich a dataset with the information from another by joining them together using a common key.