Pivoting values, sometimes called unstacking, is the action of reshaping a dataset from long format to wide format.
Suppose we have a dataset that contains an aggregated order log from a retail company, and each row represents the total orders per country, per type of item sold.
If we want to examine the amount spent on each item, across countries, we could reorganize the dataset by pivoting the values as follows:
We’ll work with the Haiku T-shirt company’s Orders_by_Country_Category dataset, which you can download and import into a new project.
The original dataset contains a separate row per country, per t-shirt category.
We are interested in examining how sales of categories of shirts compare across countries; however, the current dataset is hard to read for this purpose. Instead, we want to pivot the dataset on the country, so that each row corresponds to a different t-shirt category, and each column records the amount spent in each country.
From the Actions menu of the Orders_by_Country_Category dataset, choose Pivot. Choose ip_address_country as the column to pivot by. Give the output dataset a descriptive name like
Pivot_by_Country and click Create Recipe.
In the Pivot recipe:
After clicking Run, the resulting dataset is ready to go. Note that because all categories of shirts have been sold in many countries, more than 20 countries are the “most frequently occurring”, and the Pivot recipe chooses 20 from among the ties.
Unlike most other recipes, the schema of the output dataset of a Pivot recipe is affected by the values of the input dataset. In this case, the number of columns in the output dataset was determined by the fact that there are more than 20 countries, so the number of columns is limited to the 20 most frequent. But what happens as new orders come in? Let’s go back to the Output step of the Pivot recipe.
The Pivot recipe is a two-pass algorithm that determines the values to pivot on in the first pass in order to construct the schema of the output dataset, and then uses the second pass to do the pivot. NOTE: If there are a lot of unique values in the column to pivot by, the number of columns in the output dataset can grow out of control, so take care to Analyze the column to pivot by and choose the number of values to pivot carefully. By default, the maximum number of columns created by the Pivot recipe is 200.
The schema of the Pivot_by_Country dataset was established the first time we ran the Pivot recipe. Rebuilding the dataset continues to use the same schema, performing a single pass on the input of the Pivot recipe – and ignoring any changes to which 20 countries are the most frequently occuring, unless we update the schema. There are two options for updating the schema:
If there are particular countries that we want to examine, we can provide an explicit list to the Pivot recipe.
Let’s go back to the Pivot step of the Pivot recipe. We can specify the list of values we want to pivot by selecting explicit list from the pivoted values dropdown and then adding each individual value. (Alternatively, if we want to include most of the column’s values, we can +Add Values from Dataset and then remove the undesired values from the list.)
Click Run, and the resulting dataset has the number of pivoted columns limited to those explicitly listed in the Pivot recipe. By providing an explicit list, the schema of the output dataset is also determined in advance, and only a single data pass is required to do the pivot.
You can examine a completed version of this Pivot recipe on the Dataiku gallery.
The Pivot recipe can also be used to create pivot tables. Pivot tables are similar to pivoting values, but populates the pivoted columns with aggregated, rather than individual, values.