Get Started

Data Preparation


Data Preparation Walkthrough

The Visual Recipes

Visual recipes

Do you want to compute aggregations, join datasets, transfer data between sources, filter, split, or merge? This can all be achieved using the visual recipes:

Sync, Prepare, Sample/Filter, Group, Distinct, Window, Join with…, Split, Top N, Sort, Pivot, Stack, Push to editable, Export to folder, Stack.

This course covers our visual recipes in detail!

Data Wrangling

Visual data preparation

Dataiku DSS preparation scripts enable advanced data wrangling and instant visualizations.


Use Cases

Web logs

Learn how to enrich your datasets containing rich types by following our howto guide on enriching weblogs. We will cover geographic enrichment of IP addresses as well as user agent and URL parsing.

Merging and joining in a prepare recipe

To understand advanced joins in the prepare recipe, you can read this tutorial covering these concepts.

More Content on Data Wrangling

Datasets schema and columns meaning

Data wrangling starts by understanding your columns’ properties such as name, comments, storage type, and business meaning. Make sure you understand all about the difference between storage type and meaning of your data.

Writing formulas

You can find the description of all functions of the formula language in our reference documentation. If you are editing a formula in DSS, you can find this reference in the editor from the reference tab.

Handling dates

Parsing dates is a very common preprocessing step that you can chain with powerful processors, like extracting date components, enriching your data with holiday information…
Read the reference documentation of the Parse date processor

Reshaping data

Tabular data is typically stored in long or wide format. Reshaping data is the act of converting from one format to the other.

The Pivot recipe reshapes a dataset from long to wide format.

Some processors of the Prepare recipe can be used to reshape your data.

Distributed execution

Visual preparation recipes can run distributed on Hadoop and Spark.