Welcome to the second Dataiku DSS tutorial!
In this tutorial, we will learn
- how to enrich your data
- how to manage multiple datasets and combine their data
- how to create a new dataset based on script transformations
As we go through this project, we will also touch on the following concepts of Dataiku DSS:
- advanced processors
- recipes (a process to transform data from one or several datasets to create new datasets)
- updating datasets when your original source data has changed
Let's get started!
We recommend that you first complete the Tutorial 101 before beginning this one, to be sure you know the basics of how Dataiku DSS works.
If you have not already done it, start the DSS 102 tutorial directly from your Studio by clicking on the "Tutorials & Samples” button in the popup menu on the left. Then click on Tutorial 102. This will automatically create a new project and upload the data into Dataiku DSS!
To get started with your project, go ahead and check out the flow.
Remember? It’s this icon here in your navigation bar:
The DSS 102 tutorial project is preloaded with the haikushirtsales dataset that you recognize from the 101 Tutorial, as well as another dataset, customers, containing customer information.
Double click on the dataset to explore it.
It has three columns:
- department (french district)
In this part of the project, we’re going to enrich this dataset by:
- adding information about departments
- computing the age of each customer.
To start preparing your data, from the customers dataset screen, click on LAB to create a new analysis.
You can now go ahead and create a new Visual analysis by clicking on new, then Create (by leaving the default name as it is).
Now Let’s start cleaning our data!
Just like we did in our first tutorial, we need to change the meaning of the department column to Text, so we don’t have incorrect values for departments with letter codes.
Click on the data type under the column header (integer), and change it to text.
Next, we are going to use Open Data to enrich information about customers departments. Click on the add button in the Script tab.
The list of all available data manipulation processors appears. Click on the Enrich category and choose Enrich information about a French department.
A new processor has appeared in the Script tab. Define the parameters of the processor by choosing the department column and checking the box to get demography data. New columns with a yellow background should be visible now.
We only want to keep the population of the department in year 2009. Let's remove all the other extra columns. Switch to the Columns view and select the columns to be dropped:
Under the Actions menu, select Delete to actually delete the columns.
We are now going to compute the age of each customer.
Switch back to the tabular view. You can see the new step appear in the script tab to the left. Now let's continue by parsing the birth date of each customer. Click on the header of the column and select Parse date.
The studio automatically detects which date format fits the best for the values in your dataset. Strangely enough, the format is an English date format! Date processing is always full of surprises. Click OK.
A birth_parsed column has appeared with the standard timestamp format. We can now compute customers ages. Click on the column header and choose Compute time since.
Set that processor to compare date to now in years and set the output column to age. You're done!
We are pretty happy with our data enrichment so far. However, remember that what you work on in Lab is a visualisation of the effect of the different steps on a sample. To make it permanent and create a new dataset with your enriched data, you need to Deploy it now!
That's why, in this section, we are going to
- Compute the department population and age for all customers.
- Actually create a dataset with both information about customers and information about sales.
Working in RAM is crucial to have instantaneous visual feedback in the Explore screen as we are building up the data preparation script. For this reason, Dataiku DSS only loads a sample of the dataset...
But don't worry, when we run our recipe for good, Dataiku DSS will run your script on gigabytes of data in seconds with a very small memory footprint.
By default, only the first 10,000 records of your dataset are loaded. This can be configured in the Sampling tab.
Now, let’s create the actual dataset that comes out of the enrichment script we've just built.
The Recipe will enable us to compute enriched values for all customers and save the results in a new dataset.
Click on the button Deploy Script.
Fill the form by giving the name customers_prepared to the new dataset and click the Add button.
Like cooking, you need ingredients (our input dataset customers) and if you're a chef you'll produce out of these a proper meal! The meal is the output dataset customers_prepared.
Before we run our recipe, we can click on the cogwheel icon at the right of our Run button.
Here, you’ll see that we have a choice, when we change a recipe for instance, to only run that recipe, or rebuild missing dependencies for instance. Since we don’t have dependent datasets, we can stay with Run only this recipe.
Now, let's run our recipe by clicking on the Run button.
Your new dataset has been successfully built! (you know this because a notification popped up to tell you the status of your job) You can explore it by clicking on the link Explore dataset customers_prepared.
Your data has been enriched following the steps of your script. For each customer, we now have the population of his department as well as his age. We can now use this data elsewhere (which we will do in a minute).
For now, let’s have a look at what happened behind the recipe. Click on the Flow icon in the navigation bar:
This is a representations of the Datasets that are available in your current Project (Tutorial 102) within Dataiku DSS. It allows you to visualize how your datasets are processed and modified, from input datasets to output datasets. This Flow gives you an overview of your full project, with the datasets, and the recipes that modify them.
Click on the dataset haiku_shirt_sales. A list of possible actions will show up in the right panel. Click on LAB to create a new Visual Analysis recipe.
You can now analyze the dataset, and build a recipe on it.
Remember, this is the dataset from Tutorial 101. We are going to go through the same cleanup steps as we did in the previous tutorial. This is a great opportunity to exercise what you learned in the previous Tutorial.
Start by removing rows where the department feature is empty, and setting the meaning of department to Text. (If you need help, you can sneak a peak at Tutorial 101)
Okay, now we’re going to add more information to this dataset with our customer data. Click on the Add new steps button in the script tab, filter to find all the Join processors and select the Join with other dataset (memory-based) processor.
You can checkout what a join recipe is over here:
We are going to set the processor so that for each line in our sales dataset, we get the customer information (their age and department population) when the user_id matches.
Fill the form:
- Join column (here) is user_id,
- Dataset to Join with is customers_prepared,
- Join column (in other dataset) is also user_id,
- Columns to retrieve are department_population_2009 and age,
- Prefix is customer_.
When you scroll to the right, you’ll find the columns customer_department_population_2009 and customer_age have been added on the right.
Now that we’re happy with this preparation, let's export this script to a Recipe so we can compute the final dataset with sales information and customers information. Click on the Deploy Script button.
... and set the name of the New dataset to sales_and_customers and click Deploy.
A new recipe has been created. Now you can click on the Run button to build the new dataset!
When the building job is completed, click on the Actions button and choose View in Flow to see how this new step appears in the Flow.
You can see that the final output dataset sales_and_customers was built out of the two other datasets haiku_shirt_sales and customers_prepared.
Data-driven Flow reconstruction
Data flow is a very handy way to see how data has been processed in Dataiku DSS.
More specifically, the graph of the Flow has another benefit. When data changes over time, Dataiku DSS can update the datasets that would be affected by these changes. Because the flow reflects these dependencies, updates will only be run when necessary.
In other words, the Dataiku Science Studio is data driven: depending on the settings, the Studio will monitor the data changes and dependencies to keep everything up to date.
Congratulations, you have completed this second tutorial of Dataiku Data Science Studio.
Proceed to Tutorial 103 to see how Dataiku DSS can help you build quick predictive analytics!
If you want to dive deeper, you can also start exploring the concepts