Dear user, welcome to the first Dataiku DSS tutorial!
The first steps of any typical Data Science project, is to load and explore different datasets. In this tutorial we will look into how to do that in Dataiku DSS.
We will learn step by step how to:
- load a new file
- explore its content
- clean it up
- visualize data and build charts
On our way through this hands-on tutorial, we will go through some very important concepts in DSS:
- the Dataset
- the Lab module (particulary the Visual analysis)
- the Charts builder
You can read along for a written version of the tutorial, or watch this 20 minute video to help you with your first steps in the Studio.
Let’s get started!
In this tutorial, we will analyze the order log of a fictional T-shirt making company called “Haiku T-Shirt”.
Create your project
If you have not done it yet, start the tutorial within DSS by clicking on the “Tutorials” button in the left pane. Select Tutorial 101.
You are now on your project home page. Let’s start by loading our first Dataset in DSS! Click on the Import your first dataset button:
DSS will let you connect to a wide variety of data sources. For this tutorial, we’ll upload a file from your own computer to DSS.
First of all, start by downloading the orders CSV file.
Once you have the file on your computer, click on Upload your files.
You are presented with the Upload dialog. Click on Add a file, select the file you just downloaded, and validate.
The file is uploaded to DSS. Let’s now check if DSS detected our CSV format correctly by clicking on the Preview button:
The import is almost perfect. The CSV has been detected using a Tab separator. You can see the data is in a tabular format, with colums called features and lines which represent observations. One thing is wrong with our dataset though… Apparently the file contained a blank line between the header and the data. Let’s just input 1 in the Skip next lines to sort this out:
We can now give our new dataset a name. Enter
in the field on top of the screen. Finally, we need to save our work by
either hitting the Create button or using the shortcut Ctrl-S.
In Dataiku DSS, a Dataset is any piece of data that you
have, and which is in tabular format. A CSV file like
haiku_shirt_sales.csv is a dataset. An excel file is also a dataset.
More generally, companies (and people) have systems to store all their data. They can store it in an excel file, or in a relational database or distributed storage systems if they have larger amounts of data.
Most of the time, importing a dataset means that you merely inform Dataiku DSS of how it can access the dataset.
An external dataset remembers the location of the original data. Dataiku DSS can be set to connect to this data so that whenever the original data changes, the corresponding dataset in the Studio is automatically updated.
For more information about datasets, check out the concepts
Explore your data
Your dataset has been created, and you are now taken to a tabular view of your data where you can start exploring it.
For each column, you should see that Dataiku Science Studio has detected a meaning, in blue (in our case Text, Number or Date (unparsed)). A gauge indicates the ratio of the column for which the values do not seem to match the meaning (in red) or are completey missing (blank). In our dataset for example, the department has empty values, as well as invalid data.
The Data Scientists’ Murphy’s law states that real world data is never in the right format. Cleaning it up usually consists of a chain of scripts with a lot of business logic, that are always difficult to maintain. Sadly, a large part of the job of the Data Scientist is to clean up data. Dataiku DSS has a dedicated tool to make this task more user-friendly.
Let’s get started with data manipulation: click on Lab.
The Lab window opens. The lab is where you will iteratively work on your dataset to get further into it. In this tutorial, we are going to use the Visual analysis part of the Lab.
Click on the New button below Visual analysis. You will be prompted to specify a name for your analysis. Let’s leave the default name for now:
You are now in the Visual Analysis, one of the central places of DSS where you can:
- Clean and enrich your data
- Build charts on your enriched data
- Build machine learning models on your enriched data.
You will usually need to uncover the content of new datasets. DSS provides a set of useful features to perform these tasks. For instance, if you click on “Analyse” from the menu available from the column headers, you will get instant access to summary statistics:
As this can get tedious with large datasets to analyze each column one by one, DSS has a very handy Quick Columns view mode (last button on the right), that will let you display these statistics on several columns at a time:
Note that the statistics depend on the nature of the data:
- average, min, max… for numerical and date columns
- frequencies for textual columns
You can switch between columns using the chevrons at the top of the Analyze modal.
While exploring your dataset, you may now have noticed some potential
issues. For instance, we saw earlier that the
departement column has some missing values.
These missing values can pollute the analysis, so you may then want to remove them. Click on the menu from the header of the column, click on “More”, then select “Remove rows with no value”:
Notice that the removal action has appeared in the Script tab on the left. You just added the first Step to your script.
At this point, the rows are still visible but appear struck out. You are in the “Step preview” mode. In this mode, you can see what the step does.
If you want to really see your dataset as it will appear after processing, click on the Disable preview button in the top bar.
Some rows also have their department marked as invalid (the red in the gauge)
Let’s find out which values are considered as invalid. Click on the header of the column department again and on Filter.
A Column filter has been open. Note that nothing has changed in the Script tab. These filters are not steps, they do not change the data, they only filter out the data which is displayed in the tabular view, to allow you to focus on interesting things.
Click on the “NOK” checkbox to filter on invalid rows.
Since we now only see invalid rows, the histogram disappears. Click anywhere to close the filter overlay. We can now see what data exactly is invalid. A red filter icon on the department column header reminds you that the displayed data is currently filtered on this column.
You also see that the “matching rows” counter on the right has updated, indicates us that out of the 10000 rows in the table, only 14 match the filter.
In our case, a department is a French administrative division. This area code is typically a 2-digit number, but Corsica’s two departments have the respective codes 2A and 2B. Therefore, it makes sense for these values to look invalid for a Integer but we will change the meaning to indicate that this column is really textual. Now that we have found out the nature of invalid values, let us remove the filter on the department column by either clicking on the header red filter icon or on the remove cross in the element in the Filter tab.
Let’s force the meaning of the column as Text. Click on the blue Integer display, and in the popup, select Text.
Let’s now check out the category column which is one furthest to the right. Click on its column header. Select Analyse, to get an overview of the distribution of the values of the column.
A popup opens containing information about the proportion of all of the categories.
It seems that one of the categories has been wrongly spelled several times. Surely Bk TShirt M is standing for Black T-Shirt M. Let’s merge these two categories into one. Select the two categories to merge together by clicking on the checkboxes on their left and click on Mass Actions > Merge selected.
The label offered for newly merged category is Black T-Shirt M which is perfect. Let’s just confirm this operation by clicking on the Merge button.
The two categories have been merged together as expected! Let’s now close the analysis window. Our merge operation now appears as a processing step.
Script and Processors
When using the Script tab of the Analysis, we are building a sequence of actions that are registered in the script. Each step in the sequence is called a processor and reflects a single data transformation.
The original data is never modified in the Analyse module, but you are visualizing the changes on a sample of your data (10 000 lines by default).
To apply the transformations to your whole dataset and create an output dataset with the cleaned data, you'll have to deploy the script as we will see later on.
The Script tab has many benefits:
- First, it is like a Cancel menu on steroids. You can modify / delete a step that you added earlier.
- Second, it is a history of actions that tells you how a dataset is being modified for future reference.
- Last but not least, the script can then be used to create a new dataset with all the data transformations.
You will learn more about the power of processors in the Tutorial 102: Clean and enrich
Because Visualization is often the key to exploring your data and getting valuable business insights, we will now build a chart on top of the cleaned data we will have after we deploy our script.
In our example, we want to compare the success of the t-shirts sales with their prices.
The visualization screen to build charts is available by clicking on the Charts tab.
First, we’re going to plot an histogram of the number of items sold per
category. Drag the box named
category from the left pane, to the By drop area and drag the box named
from the left pane to the Show drop area.
Our first chart appears, but that’s not exactly what we aimed for. This plot is showing the average number of items sold per purchase (Indeed, the Y axis variable summary says Avg of nb_tshirt). We would like to have a total number of items sold per category. To correct this, click on the nb_tshirts (AVG) label and change the setting of the aggregation function from AVG to SUM.
By default, categories are ordered by alphabetical order. You can change this setting by clicking on category in the X axis zone, and select for instance by decreasing number of sold t-shirts:
Since we want to focus on t-shirts only, let us remove the hoodies from
the chart. Drag the
category variable from the left side to the
Filter zone on the right side and unclick the
We are going to split each bar depending on the T-shirt price.
tshirt_price variable from the left side to the
And label just beside the
Our resulting plot isn’t really easy to interpret. Let us change the plot type. Click on the chart type selector on top left and choose Stacked.
We’re almost done. Instead of using ranges of price, let us use the raw prices of the t-shirts. Click on the tshirt_price in the Color and set Binning to None. Use raw values.
We now have the graph we were aiming for!
Charts can help further explore your data to uncover potential correlations in your data that are first steps before you build your predictve model. They give you an alternative vision to the tabular view of a script's output.
While you work in charts in the Analyse module, keep in mind that these are built on a sample of your data. If you want to get business insights to share with your team out of them, you will need to first deploy your script and then check out the documentation on graphs.
Apply transformations to the whole original dataset
Now that we’ve finished cleaning our data and exploring it with Charts on a sample of our data, lets apply these changes to our whole dataset and create a cleaned output dataset.
To do this, go up to top right corner of the screen and click on Deploy Script.
A popup appears to Deploy the Script as a Preparation Recipe. In Dataiku DSS, the process that takes an input dataset, applies a script on it, and produces a new dataset is called a Recipe. Here you should tick the box to Build a new dataset from the transformations made on your original data. Then click on Deploy and Build to build your new dataset.
Once you’ve done that, you’re taken to your Project Flow while the new dataset is built. This is where you can visualize the different steps of your project. You can see the Preparation Recipe you just did, represented by a broom, and the new output dataset.
Congratulations, you have completed your first Dataiku DSS tutorial!
Proceed to Tutorial 102: clean and enrich to work on your flow and learn more about the power of preparation scripts and processors or Tutorial 103: Your first machine learning model to see how Dataiku Science studio can help you quickly build predictive analytics!
You can also explore the User Guide for a deeper dive.