Sessionization boils down to combining discrete events into sessions, a unit of measurement widely used when dealing with time series data.
A session can often be viewed as the ordered list of a user's actions in completing a task. This is why it is widely used in several domains. Here are a few examples:
The most common use is in web analytics. A session is composed of all actions done by a user in one particular visit on the website. You can think of this as a buying session on a e-business website for example. Using sessions enables you to answer the following questions: what are the most frequent paths to purchase on my website? How do my users get to a specific page? When / why do they leave? Are some acquisition funnels more efficient than others?
Given the GPS coordinates history of a vehicle, it is interesting to compute sessions in order to extract the different trips. Each trip can then be labelled distinctly (user going to work, on holidays, ...).
In predictive maintenance, a session can be all information relative to a machine behaviour (machine working, on idle ... etc.) until a change in its assignation.
So given a "user id" or a "machine id", the question is: how can we recreate sessions? Or more precisely: how do we choose the boundaries of a session? In the above examples, the third one is easy to deal with: a new session is defined when there is a change of assignation. If we look at the two first examples however, we see that we can define a new session if the user has been inactive for a certain amount of time T. So in this case, a session is defined by two things: a user id and a threshold time such as if the next action is in a range of time greater than T, it defines the start of a new session.
This threshold can depend on the web site constraints themselves: for instance, banks close their users' navigation sessions after around 15 minutes of inaction, and so do the e-business companies.
Note that the two first examples can be seen as a generalisation of the third one where we do not have clear boundaries. So in the rest of this article, we will focus on this type of session creation. For now, we will use this toy data as a support for understanding. Once imported into DSS, the dataset should look like this.
Note that this dataset is already sorted by time and that there are two different users.
We are going to assume that your SQL database system supports window functions (otherwise, you may have hard times for doing advanced analytics :)). In this example, we will be using PostgreSQL.
The objective is to detect on which line there is a new session in the ordered dataset. To do that, we need to calculate the interval between two lines of the same user. The window function lag() does it for us:
This sql query outputs a new column which is the difference between the timestamp of the current row and the previous one, by user_id. Notice that the row of the first appearance of a user contains a NULL time interval since the value cannot be calculated.
Based on this interval, we are now going to flag each session of given user. Assuming a new session is defined after 30 minutes of inactivity, we can slighlty transform the previous expression into this one:
This query creates a boolean column, a value of 1 indicating a new session, 0 otherwise.
Finally, we can just create a cumulative sum over this boolean column to create a "session id". To make it easier to visualize, we can concatenate it with the userid, and then build our final "sessionid" column:
We finally have our sessionized data! We would be able to move now towards more advanced analytics and deriving new KPI's.
Note that this code is enough for this simple example. However, if we had a big partitioned hive table, a simple increment may not be enough (because we would have collision for each partitions). In this case, it is possible to concatenate the user id with the epoch of the first row of the session. This can be done by calculating the first timestamp of each session and join it on the session_id previously calculated.
If your data is stored in Hadoop (HDFS), and you can use Hive (and a version >= 0.11, where window partitioning functions were introduced), creating sessions will be very similar to the previous example in PostgreSQL. To make it work properly, juste make sure that your timestamp column has been serialized as "Hive" in your DSS dataset, (in Dataset => Settings => Preview).
In DSS, do not forget to set your dataset's quoting style to "No escaping nor quoting". To get rid of the date parsing issues in Pig, make sure to set the Date Serialization Format (in Dataset, Settings, Preview) to "ISO".
The Pig script below reproduces the previous examples, including setting the limit for new sessions to 30 minutes:
The final result is different here. We have no increment but a session_id built with a hash. Like previously, it is possible to use this id to create a new one as the concatenation between the user_id and the first unix timestamp in the session.
Finally, let's build sessions using Python. There are several ways to do it, from a pure Python implementation to reproducing the logic above using the famous Pandas library, which what we are going to show here.
The output of this code should give you the sessionized dataset similar to the hive/postgresql example. We can see that the shift function is equivalent to lag in postgresql, and transform to a window function.
By now, you should be able to sessionize your data in your favorite language (we didn't show how to do in R, if you are interested please ping us) and derive the KPI's needed wether you have weblogs, telematics or other kind of data.