DSS 101: Your very first steps

Since DSS 4.0.8, this tutorial is outdated. Please head over to the new Basics tutorial

Dear user, welcome to the first Dataiku DSS tutorial!

The first steps of any typical Data Science project, is to load and explore different datasets. In this tutorial we will look into how to do that in Dataiku DSS.

We will learn step by step how to:

  • load a new file
  • explore its content
  • clean it up
  • visualize data and build charts

On our way through this hands-on tutorial, we will go through some very important concepts in DSS:

  • the Dataset
  • the Lab module (particulary the Visual analysis)
  • the Charts builder

You can read along for a written version of the tutorial, or watch this 20 minute video to help you with your first steps in the Studio.

Let’s get started!

In this tutorial, we will analyze the order log of a fictional T-shirt making company called “Haiku T-Shirt”.

Create your project

If you have not done it yet, start the tutorial within DSS by clicking on the “Tutorials” button in the left pane. Select Tutorial 101.

Load the dataset

You are now on your project home page. Let’s start by loading our first Dataset in DSS! Click on the Import your first dataset button:

Screenshot of the project home page showing the import your first dataset button

DSS will let you connect to a wide variety of data sources. For this tutorial, we’ll upload a file from your own computer to DSS.

First of all, start by downloading the orders CSV file.

Once you have the file on your computer, click on Upload your files.

Screenshot of file upload page: drag and drop your files

You are presented with the Upload dialog. Click on Add a file, select the file you just downloaded, and validate.

The file is uploaded to DSS. Let’s now check if DSS detected our CSV format correctly by clicking on the Preview button:

Screenshot of file upload page: connection to data source succeeded

The import is almost perfect. The CSV has been detected using a Tab separator. You can see the data is in a tabular format, with colums called features and lines which represent observations. One thing is wrong with our dataset though… Apparently the file contained a blank line between the header and the data. Let’s just input 1 in the Skip next lines to sort this out:

Screenshot of format parameters page

We can now give our new dataset a name. Enter haiku_shirt_sales in the field on top of the screen. Finally, we need to save our work by either hitting the Create button or using the shortcut Ctrl-S.

Screenshot of data prep demo: new dataset name


In Dataiku DSS, a Dataset is any piece of data that you have, and which is in tabular format. A CSV file like haiku_shirt_sales.csv is a dataset. An excel file is also a dataset.

More generally, companies (and people) have systems to store all their data. They can store it in an excel file, or in a relational database or distributed storage systems if they have larger amounts of data.

Most of the time, importing a dataset means that you merely inform Dataiku DSS of how it can access the dataset.

An external dataset remembers the location of the original data. Dataiku DSS can be set to connect to this data so that whenever the original data changes, the corresponding dataset in the Studio is automatically updated.

For more information about datasets, check out the concepts

Explore your data

Your dataset has been created, and you are now taken to a tabular view of your data where you can start exploring it.

For each column, you should see that Dataiku Science Studio has detected a meaning, in blue (in our case Text, Number or Date (unparsed)). A gauge indicates the ratio of the column for which the values do not seem to match the meaning (in red) or are completey missing (blank). In our dataset for example, the department has empty values, as well as invalid data.

Screenshot viewing dataset sample: cleaning data

The Data Scientists’ Murphy’s law states that real world data is never in the right format. Cleaning it up usually consists of a chain of scripts with a lot of business logic, that are always difficult to maintain. Sadly, a large part of the job of the Data Scientist is to clean up data. Dataiku DSS has a dedicated tool to make this task more user-friendly.

Let’s get started with data manipulation: click on Lab.

Screenshot of lab window: new visual analysis

The Lab window opens. The lab is where you will iteratively work on your dataset to get further into it. In this tutorial, we are going to use the Visual analysis part of the Lab.

Click on the New button below Visual analysis. You will be prompted to specify a name for your analysis. Let’s leave the default name for now:

Screenshot of lab window: analysis name

You are now in the Visual Analysis, one of the central places of DSS where you can:

  • Clean and enrich your data
  • Build charts on your enriched data
  • Build machine learning models on your enriched data.

Screenshot of visual analysis screen

You will usually need to uncover the content of new datasets. DSS provides a set of useful features to perform these tasks. For instance, if you click on “Analyse” from the menu available from the column headers, you will get instant access to summary statistics:

Screenshot of drop down analyze menu

Screenshot of summary statistics window

As this can get tedious with large datasets to analyze each column one by one, DSS has a very handy Quick Columns view mode (last button on the right), that will let you display these statistics on several columns at a time:

Screenshot of columns quick view

Note that the statistics depend on the nature of the data:

  • average, min, max… for numerical and date columns
  • frequencies for textual columns

You can switch between columns using the chevrons at the top of the Analyze modal.

Clean and enrich your data

While exploring your dataset, you may now have noticed some potential issues. For instance, we saw earlier that the departement column has some missing values.

These missing values can pollute the analysis, so you may then want to remove them. Click on the menu from the header of the column, click on “More”, then select “Remove rows with no value”:

Screenshot of the menu, clicked for more actions: remove rows with no value

Notice that the removal action has appeared in the Script tab on the left. You just added the first Step to your script.

At this point, the rows are still visible but appear struck out. You are in the “Step preview” mode. In this mode, you can see what the step does.

If you want to really see your dataset as it will appear after processing, click on the Disable preview button in the top bar.

Screenshot of the preview of the dataset

Some rows also have their department marked as invalid (the red in the gauge)

Let’s find out which values are considered as invalid. Click on the header of the column department again and on Filter.

Screenshot of the analyze window with filtering opened

A Column filter has been open. Note that nothing has changed in the Script tab. These filters are not steps, they do not change the data, they only filter out the data which is displayed in the tabular view, to allow you to focus on interesting things.

Click on the “NOK” checkbox to filter on invalid rows.

Screenshot of the filtering window with the NOK box checked

Since we now only see invalid rows, the histogram disappears. Click anywhere to close the filter overlay. We can now see what data exactly is invalid. A red filter icon on the department column header reminds you that the displayed data is currently filtered on this column.

You also see that the “matching rows” counter on the right has updated, indicates us that out of the 10000 rows in the table, only 14 match the filter.

Screenshot of the analyze window after filtering

In our case, a department is a French administrative division. This area code is typically a 2-digit number, but Corsica’s two departments have the respective codes 2A and 2B. Therefore, it makes sense for these values to look invalid for a Integer but we will change the meaning to indicate that this column is really textual. Now that we have found out the nature of invalid values, let us remove the filter on the department column by either clicking on the header red filter icon or on the remove cross in the element in the Filter tab.

Let’s force the meaning of the column as Text. Click on the blue Integer display, and in the popup, select Text.

Screenshot of the auto-detect menu setting value to text

Let’s now check out the category column which is one furthest to the right. Click on its column header. Select Analyse, to get an overview of the distribution of the values of the column.

Screenshot of the distinct values window for the category column

A popup opens containing information about the proportion of all of the categories.

It seems that one of the categories has been wrongly spelled several times. Surely Bk TShirt M is standing for Black T-Shirt M. Let’s merge these two categories into one. Select the two categories to merge together by clicking on the checkboxes on their left and click on Mass Actions > Merge selected.

The label offered for newly merged category is Black T-Shirt M which is perfect. Let’s just confirm this operation by clicking on the Merge button.

Screenshot of the distinct values window merging 2 values

The two categories have been merged together as expected! Let’s now close the analysis window. Our merge operation now appears as a processing step.

Script and Processors

When using the Script tab of the Analysis, we are building a sequence of actions that are registered in the script. Each step in the sequence is called a processor and reflects a single data transformation.

The original data is never modified in the Analyse module, but you are visualizing the changes on a sample of your data (10 000 lines by default).

To apply the transformations to your whole dataset and create an output dataset with the cleaned data, you'll have to deploy the script as we will see later on.

The Script tab has many benefits:

  • First, it is like a Cancel menu on steroids. You can modify / delete a step that you added earlier.
  • Second, it is a history of actions that tells you how a dataset is being modified for future reference.
  • Last but not least, the script can then be used to create a new dataset with all the data transformations.

You will learn more about the power of processors in the Tutorial 102: Clean and enrich

Visualize your data with charts

Because Visualization is often the key to exploring your data and getting valuable business insights, we will now build a chart on top of the cleaned data we will have after we deploy our script.

In our example, we want to compare the success of the t-shirts sales with their prices.

The visualization screen to build charts is available by clicking on the Charts tab.

Screenshot of the cleaned script output with arrow pointing to the charts button

First, we’re going to plot an histogram of the number of items sold per category. Drag the box named category from the left pane, to the By drop area and drag the box named nb_tshirts from the left pane to the Show drop area.

Screenshot of the dragging and dropping of values to create a chart

Our first chart appears, but that’s not exactly what we aimed for. This plot is showing the average number of items sold per purchase (Indeed, the Y axis variable summary says Avg of nb_tshirt). We would like to have a total number of items sold per category. To correct this, click on the nb_tshirts (AVG) label and change the setting of the aggregation function from AVG to SUM.

Screenshot of setting y axis value to SUM

By default, categories are ordered by alphabetical order. You can change this setting by clicking on category in the X axis zone, and select for instance by decreasing number of sold t-shirts:

<img src=”/static/img/learn/guide/tutorials/101/charts-set-sort.png” class=”centered-img” “Screenshot of setting x axis sorting to descending(sum sold t-shirts)”/>

Since we want to focus on t-shirts only, let us remove the hoodies from the chart. Drag the category variable from the left side to the Filter zone on the right side and unclick the Hoodie checkbox.

Screenshot of the chart window applying filters

We are going to split each bar depending on the T-shirt price. Drag the tshirt_price variable from the left side to the And label just beside the category variable.

Screenshot of the chart window adding a dimension to the x axis

Our resulting plot isn’t really easy to interpret. Let us change the plot type. Click on the chart type selector on top left and choose Stacked.

Screenshot of the select chart type window

We’re almost done. Instead of using ranges of price, let us use the raw prices of the t-shirts. Click on the tshirt_price in the Color and set Binning to None. Use raw values.

Screenshot of setting chart binning to use raw values

We now have the graph we were aiming for!

Screnshot of the finished graph


Charts can help further explore your data to uncover potential correlations in your data that are first steps before you build your predictve model. They give you an alternative vision to the tabular view of a script's output.

While you work in charts in the Analyse module, keep in mind that these are built on a sample of your data. If you want to get business insights to share with your team out of them, you will need to first deploy your script and then check out the documentation on graphs.

Apply transformations to the whole original dataset

Now that we’ve finished cleaning our data and exploring it with Charts on a sample of our data, lets apply these changes to our whole dataset and create a cleaned output dataset.

To do this, go up to top right corner of the screen and click on Deploy Script.

Screenshot of chart window with arrow pointing to deploy script

A popup appears to Deploy the Script as a Preparation Recipe. In Dataiku DSS, the process that takes an input dataset, applies a script on it, and produces a new dataset is called a Recipe. Here you should tick the box to Build a new dataset from the transformations made on your original data. Then click on Deploy and Build to build your new dataset.

Screenshot of the deploy as preparation recipe window

Once you’ve done that, you’re taken to your Project Flow while the new dataset is built. This is where you can visualize the different steps of your project. You can see the Preparation Recipe you just did, represented by a broom, and the new output dataset.

Screenshot of the project flow

Learn more

Congratulations, you have completed your first Dataiku DSS tutorial!

Proceed to Tutorial 102: clean and enrich to work on your flow and learn more about the power of preparation scripts and processors or Tutorial 103: Your first machine learning model to see how Dataiku Science studio can help you quickly build predictive analytics!

You can also explore the User Guide for a deeper dive.