Window functions are one of the most powerful features of SQL (and SQL-like: Hive, Impala, Spark, ...) databases. They are also one of the least known and most tricky.
The Window visual recipe in DSS makes it extremely easy to leverage these features. Thanks to this, you'll be able to do things like:
- Filter rows by order of appearance within a group
- Compute moving averages, cumulative sums, ...
- Compute the number of events occured in the 7 days prior to another event
The PostgreSQL documentation has a very good introduction to the concept of Window analytics. Quoting:
Definition of a window function
A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.
In other words, unlike a grouping recipe, a window recipe does not reduce the number of rows in a dataset. It creates new columns in a dataset that are the results of computations that use all rows in a "window", ie a subset. of all rows of the dataset.
The window can be:
- partitioned: you define one or several columns, and one window is created for each combination of values of these columns.
bounded: For each row, we can restrict the window to start:
- at the beginning (resp. to the end)
- at a given number of rows before the current row
- at a given range of values before a given column of the current row
- at the current row
The same bounding can be applied to the end of the window. In addition, a window can be ordered (this is actually required for bounding it).
Enough theory, let's see some examples!
The window recipe needs a compatible engine to operate. Compatible engines are:
- A SQL database among: PostgreSQL, Greenplum, Vertica, Oracle, SQLServer, Redshift
- Hadoop Hive
- Cloudera Impala
The rest of this tutorial will assume that you have a PostgreSQL dataset, but the instructions are similar for the other kinds of supported engines.
Therefore, before following this tutorial, please make sure that you have a working connection to PostgreSQL (or another one) defined in your DSS.
Please see our howto for that.
Starting the tutorial
This tutorial will guide you through window recipe using two datasets:
an extract from the "Blue book for Bulldozers" Kaggle competition data. It is a dataset where each row represents the sale of a used bulldozer.
an extract of a web log, containing events where user switch from one page to another
To follow the tutorial, open it from the DSS home page > Tutorials > Visual recipes.
The datasets are already available in the project, so you can go on and Explore them
As already mentioned, the window recipe only operates on SQL or HDFS datasets. Let's take both of our datasets and send them to the PostgreSQL connection.
The weblogs dataset
The weblogs dataset is already "clean" and properly typed, so we only need to Sync it: select it in the Flow, choose "Sync", make sure to select your PostgreSQL database in "Store into", Create the recipe and run it.
The bulldozer dataset
On this dataset, we've left you with a bit of work ! We'll be interested in the "saledate" field, but it is not properly formatted. You'll need to use a Prepare recipe (with output in PostgreSQL), parse the saledate field, and run it. This will have the effect of properly selecting the type for all fields.
Your flow should look like this.
The bulldozer dataset contains one line per sale, with information about the sale, information about the bulldozer, the sale date and the sale price.
For this first example, we want to add a column to each line: the average price of all bulldozers of the same category "ProductGroup".
Let's create a new Window recipe from the bulldozer (in PostgreSQL) dataset.
We arrive on the first screen of the window recipe: the screen where you define the windows.
Conceptually, there are two steps when defining a window recipe:
- Defining the window(s): on what rows of the dataset will the computations apply ?
- Defining the aggregation(s): What do I compute on this window ?
Here, we want to add an information about all datasets with the same ProductGroup, so we want to partition the window by ProductGroup, and not bound it, nor order it.
- Activate "Partitioning columns"
- Select the ProductGroup column
Let's now move on to Aggregations.
We want to Retrieve all columns of the dataset (ie, we keep all existing data), and add another column: the Avg of Saleprice. Check the Saleprice > Avg box.
You can now Run your recipe! Accept the schema change warning, and run. You can then explore the output dataset and go to the far right. You'll see your new SalePrice_avg column.
All rows in the beginning of the dataset have the same value. That's because, to compute your windowing, PostgreSQL has actually sorted the dataset by ProductGroup, so you see all bulldozers of a same ProductGroup sorted together, and see the same price everywhere.
You can verify that each time the ProductGroup changes, the average price changes.
Thanks to this new column, we could now compute, for example, the difference between the Average price of the ProductGroup and the SalePrice of an actual bulldozer, to get a notion of whether this particular bulldozer seems "overpriced".
Let's continue toying with this "overpriced" notion. I would like to know which bulldozers are the most expensive of their ProductGroup. In other words, for each ProductGroup, I would like to sort by SalePrice DESC and get only the first one. Window functions make it very easy !
Go back to your recipe, in the Window definition. This time, we'll want to order the Window by SalePrice decreasing.
Now go back to Aggregations. First, let's disable "Retrieve" on most columns because it annoys us to have a too big dataset. Only keep Retrieve on "SalesID", "SalePrice", "ProductGroup"
Let's now look at the top of the screen. In addition to computing aggregations based on the fields in the window, we have a few options to compute, for each row, some additional information that does not depend on the fields. The main of these functions is "RANK".
RANK is the "index" of the current row within its window. In other words, since we have a window, partitioned by ProductGroup, and ordered by decreasing SalePrice, the most expensive bulldozer of each ProductGroup will have rank=1, the second most expensive of each group will have rank=2, ...
Enable "Rank" and run the recipe. As we can see, we now have a dataset sorted by ProductGroup and SalePrice and the most expensive "BL" bulldozer has rank = 1
At this point, we notice two important things:
First of all, the rank are not continuous. That's because we need to break the ties. There are three variants of rank:
- Rank will output 1,1,3,3,3,5
- Dense rank will output 1,1,2,2,2,3
- Row number will output 1,2,3,4,5,6 (ties are broken randomly)
Strangely, the Average price now changes within a group! That's because as soon as you Sort a window, the definition of the window automatically changes and it becomes "bounded" between "beginning" and "current row". In other words, now, the Average price is computed for each bulldozer by only using the data of the more expensive ones (since the window is ordered by decreasing price).
We could cancel this effect by forcing a Unbounded/Unbounded window in the DSS UI. But for the moment, we are actually going to use this to compute cumulative sums
Go back to the recipe. We'd now like to know the proportion of the total sales made as time passes. In other words, we want to compute the cumulative distribution of the summed price.
We're not anymore interested in ProductGroup, we want a global information. So let's remove "Partitioning" from our window. We are going to instead order our window by increasing saledate (ie, in chronological order).
Now, let's check the Sum on SalePrice. Since the window is ordered, for each bulldozer, this will compute the sum of all previous bulldozers and this one. In other words, it's a cumulative sum, exactly what we wanted.
To better see what's going on, also check "Retrieve" on
Run your recipe.
Here is our cumulative sum. We can have fun and graph this:
- In blue the cumulative amount
- In green the non-windowed sum on the same period (in other words, the increment on this period)
Let's do one final thing with this dataset. For each bulldozer, we want to have a moving average of the 10 previous sales of this ProductGroup. This will allow us to have a vision of how the market is moving.
We'll be using the "Window frame" feature to limit the window to the 10 rows preceding the current row (up to the current row).
Configure your window this way:
Activate AVG on SalePrice and Run. As we could expect, this produces a smoothed average, also very useful to have nice charts and a vision of long-term evolution!
Enough with the bulldozers, let's switch to our other dataset: a web log. It contains events on a website. For each event, we have:
- its date
visitor_id: the unique identifier of the visitor
session_id: Automatically reset each time a visitor comes back on the site
page_id: the name of the page on the site
Notice these cool page_ids ? Star Wars planets! It's all made with our Anonymizer plugin
There can be several events on the same page (for example because the user clicked a button on a page, and that triggered an event).
We want to only keep the first event of each page for each session (regardless of whether the user went back to the page). Window functions can do that!
Create a window recipe, and let's make a window with the following settings:
- Partitioned by
- Ordered by server_ts (Ascending)
- Aggregations: RANK (global), and keep all columns on Retrieve for the moment.
If you run, you'll see that we have now added the index (or Rank) of each event within its (
What we want now is to only keep the rows where rank=1. We could do that with a prepare recipe, but that feature is actually builtin. Go back to the recipe, in the Post-Filter section, and activate a filter on "rank == 1"
Our output dataset now only contains the events that are the first in the (session, page) couple. In other words, the first time that the user interacted with a page within a given session.
Now it would be great to know the time that the user needed between each first-page-interaction in its session. In other words, we want to compute the difference in date between the rank=1 items that we extracted, within each session.
There's a Window function for that: LAG. LAG retrieves the value of a column in the previous (or a previous) row and inserts it in the current row in another column. Furthermore, DSS adds the "Lagdiff" function to automatically compute the difference between this previous value and the current value.
For this use case, we'll create a new window recipe based on the dataset that we just computed (with rank==1). In other words, we chain two Window recipes.
We want to:
- partition by session only, ordered by ascending date
- Retrieve the
- Compute "Lag" and "Lagdiff" on the server_ts column
The "Lag offset" is 1 because we want to retrieve the value of the 1st previous row. For LagDiff we also need to specify the unit in which the time difference will be expressed. Let's choose seconds.
Run your recipe and let's look at few answers.
001516e1b8f688bonly visited one page
malastare_2. There is no info.
trandosha_2. The first one being the first one, there is no info. But on the second one, we see that it was 11 seconds after.
- The following few sessions only have one page each, so no lag info.
008615e12f23064visited quite a bunch of pages, staying 5 seconds on the first, then 45 seconds, then 21, ...
Window functions are an incredibly powerful tool. With its visual recipe, DSS puts it in all hands!
We have only scratched the surface. The DSS window recipe can do much more:
- Bounding windows using ranges of values. Think: compute the sum of all items bought on the 7 previous days
- Multiple windows computed at once: the same, on the 7, 15, 30, 60 previous days, in a single recipe
- Custom aggregations: compute things like the "overpricing ratio" within the Window recipe
- Pre filters (like a WHERE in SQL)
- Conversion to a SQL, Hive or SparkSQL recipe to go further in your analysis