This is the first in a series of Dataiku Data Science Studio (DSS) tutorials, designed to provide a hands-on overview of the main Dataiku DSS concepts.
Throughout the tutorials, we will work with a fictional online t-shirt retailer called Haiku T-Shirt, and use their enterprise data to illustrate the basic steps to working with Dataiku.
In this tutorial, we will load Haiku T-Shirt’s order log into Dataiku DSS, perform a cursory review of the data, and apply the screening rules identified in review to create an updated dataset.
In this tutorial, we’ll upload a file from your own computer to Dataiku DSS.
Start by downloading the orders CSV file.
Create your project
The first step is to create a new Dataiku DSS Project. Click on the Tutorials button in the left pane, and select Tutorial: Basics. This creates the new project and lands you on the project home page.
Key concept: project
A Dataiku DSS project is a container for all your work on a particular activity. The project home acts as the command center from which you can see the overall status of a project, view recent activity, and collaborate through comments, tags, and a project to-do list.
Create the dataset
Let’s load our first Dataset! In the project home, click Import your first dataset.
Dataiku DSS lets you connect to a wide variety of data sources. For this tutorial, we’ll upload to DSS the orders file you just downloaded from Dataiku’s website. Click on Upload your files.
You are presented with the Upload dialog. Click on Add a file, select the orders.csv file you downloaded, and validate. The file is uploaded.
Let’s now check if Dataiku DSS detected our CSV format correctly by clicking on the Preview button.
Dataiku DSS automatically detects that the file has delimiter-separated values with a comma separator. You can see the data is in a tabular format, with columns (features) and rows (records or observations).
Dataiku DSS provides the default dataset name orders in the field on top of the screen. Since that’s OK for us, save your work by either hitting the Create button or using the shortcut Ctrl-S. This creates the new dataset and lands you on the Explore page for the orders dataset.
Key concept: dataset
In Dataiku DSS, a Dataset is any piece of data that you have, and which is of a tabular nature. A CSV file like
orders.csv is a dataset. A sheet in an Excel file is also a dataset.
More generally, companies (and people) have systems to store all their data. They can store it in an Excel file, or in a relational database or distributed storage systems if they have larger amounts of data.
Most of the time, creating a dataset means that you merely inform Dataiku DSS of how it can access the data. These external or source datasets remember the location of the original data. The data is not copied into DSS. The dataset in DSS is a view of the data in the original system.
The uploaded files dataset we just created is a bit specific, because in that precise case, the data is copied into DSS (since we don't have another database to host it yet).
For more information about datasets, check out the concepts.
Explore your data
The Explore page of a dataset provides a tabular view of your data where you can start to examine it.
When looking at or working interactively with your data, Dataiku DSS only shows a sample of it. This sample makes it possible to display your dataset, even if the data itself is huge. You control the size of the sample and the sampling method (should Dataiku DSS simply grab the first records, which is fast but can result in a biased sample, or should it extract a random sample?). To see these settings, click on Configure sample or on the lapel that appears on the left.
Storage type and meaning
Each column is identified by name, and has a storage type that indicates how the original data is stored (a string, an integer, a decimal number, a date…). For a CSV file, all of the storage types are string (because it’s just text).
Dataiku DSS also detects a meaning, in blue, which is a “semantic” type that reflects the nature of the values of the column. Dataiku DSS has many built-in meanings such as integer, US states, IP addresses, emails, etc. Here, Dataiku DSS has inferred that the tshirt_price column values are of meaning Decimal and that the tshirt_quantity column values are of meaning Integer. The meaning of a column affects whether Dataiku DSS considers a value to be valid for that column. Below the meaning is a gauge that indicates the proportion of column values that do not seem to match the meaning (in red) or are completely missing (in grey).
Dataiku DSS detects a meaning of Integer for customer_id, based upon the fact that most of values of customer_id are integers. The gauge shows red for the few values that do not match this meaning, which allows us to determine whether these values are truly invalid customer ID’s, or, as is the case here, Integer is too restrictive a meaning for customer_id. Click on the meaning and select Text to update it. Now the gauge for customer_id is entirely green.
You can use Charts to explore the input dataset; for example, we might want to know how often each type of t-shirt is ordered:
- Click on the Charts tab.
- Drag and drop Count of records as the Y variable.
- Drag and drop tshirt_category as the X variable.
DSS shows a column chart of count of records by tshirt_category.
The chart reveals that the values of tshirt_category are not consistently recorded. Sometimes black shirt color is recorded as “Black”, and sometimes as “Bl”. Similarly, white shirts are sometimes recorded as “White” and sometimes as “Wh”.
Let’s handle these issues by preparing the data with a recipe.
Prepare data with a visual recipe
A video below goes through the steps described in the following section.
Key concept: recipe
A Dataiku DSS recipe is a set of actions to perform on one or more input datasets, resulting in one or more output datasets. A recipe can be visual or code.
A visual recipe allows a quick and interactive transformation of the input dataset through a number of prepackaged operations available in a visual interface. A code recipe allows a user with coding skills to go beyond visual recipe functionality to take control of the transformation using any supported language (SQL, Python, R, etc).
Dataiku allows "coders" and "clickers" to seamlessly collaborate on the same project through code and visual recipes.
Click on the Actions button at the top right of the orders dataset and choose Prepare.
The Prepare recipe allows you to define a series of steps, or actions, to take on the dataset. The types of steps you can add to a Prepare recipe are wide-ranging and powerful.
When creating a recipe, you need to provide the input dataset (the one with the initial columns and values) and an output dataset (the one with the prepared columns and values). You can also set the “Store into” to decide where the output data will live. In this example, the output is written on the local filesystem, but the output could be written to a relational database or a distributed filesystem, if the infrastructure exists. Dataiku DSS provides a default output dataset name of orders_prepared. Click Create recipe.
In order to standardize the coding of tshirt_category, we will recode the values. Click on the column name, which opens a dropdown and select Analyze. The analyze box opens. It allows you to view a quick summary of the data in the column (for the sample of data that is being displayed by DSS). The analyze box also gives you the ability to perform various data cleansing actions.
Select White T-Shirt M and Wh Tshirt M, and from the Mass Actions dropdown choose Merge selected. Choose to replace the values with “White T-Shirt M” and click Merge. Repeat this process for other categories as needed.
Close the Analyze dialog and see that a Replace step has been added to the Prepare script. You just added the first Step to your script.
The Replace step affects 517 rows in the sample, and replaces values “Wh Tshirt M” with “White T-Shirt M”, “Bl Tshirt M” with “Black T-Shirt M”, “Wh Tshirt F” with “White T-Shirt F”, and “Bl Tshirt F” with “Black T-Shirt F” of tshirt_category in place. You can create this step explicitly in the script, but the Analyze dialog provides a quick and intuitive shortcut to build the step.
Key concept: preparation script
When using a preparation recipe, you are building a sequence of actions, or steps, that are registered in the script. Each step in the sequence is called a processor and reflects a single data transformation.
The original data is never modified, but you are visualizing the changes on a sample of your data (10 000 lines by default).
To apply the transformations to your whole dataset and create the output dataset with the cleaned data, you'll have to Run the recipe as we will see later on.
A preparation script has many benefits:
- First, it is like a Cancel menu on steroids. You can modify/delete any step that you added earlier.
- Second, it is a history of actions that tells you how a dataset is being modified for future reference.
You will learn more about the power of processors in the Tutorial: from Lab to Flow
At this point, you will see some values in yellow. This is because you are in the “Step preview” mode. In this mode, you can see what the step changes. The values in yellow are the ones that were modified by our Replace step.
If you want to see your data as it will appear after processing, click on the Disable preview (green eye) button in the top bar.
Now, let’s deal with the order_date. Input dates are unparsed, and need to be processed (“parsed”) if we want to use them. Dates are notoriously difficult to work with, but DSS makes it very easy.
Open the column dropdown and select Parse date. The Smart Date dialog opens, and shows you the most likely formats for your dates and how the dates would look like once parsed, with some sample values from the dataset. In our case, the dates appear to be in yyyy/MM/dd format. Select this format, click OK and see that a Parse date step has been added to the Prepare script.
By default, Parse date creates a new column order_date_parsed. Since we’d rather want the parsed date to stay in the order_date column, let’s clear the Output column text box in the script step, which has the effect of parsing dates in place.
Once your date column is parsed, the column dropdown offers new operations like:
- Extract date elements: year, month, day, day of week, week of year…
- Compute time since a date, another columns, today.
- Flag holidays.
Finally, let’s compute the value of each order. The orders dataset includes the number of t-shirts in each order and the price per t-shirt, but we need to compute the total of each order so that we can analyze the value of each order. We are going to use a Formula step for this. DSS formulas are a very powerful expression language to perform calculations, manipulate strings, and much more.
This time, we will not add the step by clicking on a column header, but using the processors library which references all 80+ data preparation processors:
- Click the yellow Add a New Step button in the left bar
- Select Formula (you can search for it)
totalas the name of the new column.
- In the expression, type
tshirt_price * tshirt_quantity(you can also click Edit to bring up the advanced formula editor, which will autocomplete column names)
- Click anywhere and see the new total column appearing
We no longer need the price and quantity, so we can delete those columns.
- Click on the column header
- Choose Delete
Recall that the data visible in the recipe is merely a sample, meant to give you immediate visual feedback on the design of your Prepare script. With the Prepare script complete, you must now Run the recipe on the whole input dataset. Click Run in the lower left. Dataiku DSS uses its own engine for this recipe runtime, but depending upon your infrastructure and the type of recipe, you can choose where the computation takes place.
You will also be asked to update the schema. A dataset’s schema is a list of the columns, plus their storage type and meaning. When you initially created this recipe, the default output schema was identical to the input schema because there were no steps in the recipe. You have since created the column total and removed the columns tshirt_price and tshirt_quantity, so you need to allow Dataiku DSS to update the schema.
When the job completes, click Explore dataset orders_prepared to visit the output dataset.
In the prepare recipe, DSS automatically selected the storage type of columns according to their detected meanings; thus pages_visited was written as a bigint and total as a double in the output dataset.
A natural question at this point is which categories of t-shirt are most popular? We can begin to answer this with the cleaned data and a simple bar chart. Click the Charts tab. Select total as the Y variable and tshirt_category as the X variable.
The resulting bar chart shows that, on average, orders for tennis shirts are larger than orders for other t-shirts. However, what about overall customer spending?
Open the total dropdown and choose SUM as the aggregation. This changes our perspective. While each order for tennis shirts is relatively valuable, there are relatively few orders for tennis shirts. Hoodies and male t-shirts are more commonly ordered, and the total value of those orders exceeds that of tennis shirts.
Our ultimate goal is to understand Haiku T-Shirts’ customers; to that end we want to group all past orders by customer, aggregating their past interactions.
From the actions menu of the orders_prepared dataset, choose Group in the list of Visual recipes. The Group recipe allows you to aggregate the values of some columns by the values of one or more “keys”.
Choose to group by customer_id. The default name of the new dataset is not very descriptive, so change it to
orders_by_customer. Click Create Recipe.
The Group recipe has several steps (on the left). The core step is the Group step, where you choose which columns to aggregate and what aggregations you want performed.
Some columns, like order_id, we won’t need in the new dataset. For the others, make the following selections:
- order_date: Min
- pages_visited: Avg
- total: Sum
This will give us the first date at which a customer ordered (minimum of order_date), the average number of visited pages by visit, and the sum of all orders of the customer.
In the Output step, you can select to rename the output columns. Let’s rename order_date_min to
Click Run to create the new grouped dataset, then Explore the new dataset once run is complete.
Explore the Flow
Click on the flow icon next to the project name to go to the project Flow. The flow is the visual representation of your project pipeline. You can see the Preparation recipe, represented by a broom, followed by its output dataset, followed by the Group recipe, followed by its output dataset.
Congratulations, you have completed your first Dataiku DSS tutorial!
Proceed to Tutorial: From Lab to Flow to work on your flow and learn more about the power of preparation scripts and processors!