Dataiku Web Logs Analysis (Web Logs and Hadoop Tutorial)

Logs of your website can contain various information about you clients though it is not its primary usage. By following this tutorial, you will be able to clean and enrich your web logs, recreate user sessions, derive session kpis and start a basic analysis of the customer path.

We will use an extract of dataiku.com website available [here]. The format used is standard apache log and can be found here.

This tutorial being a bit long we divide it in three main parts:

Prerequisite

We assume that you are already familiar with the basic concepts behind DSS. If not, please start here. This tutorial requires a DSS instance on a server connected to an up and running hadoop cluster. We are going to use alsp the PigLatin and Hive language. To recreate user sessions we will use the Datafu UDF so you need to download the corresponding jar and to put it on the DSS server.

Part 1: Import, Clean and Enrich Data

Uploading data

Let's start by uploading the web log dataset into a new project. On the preview screen, you should see that the detected type is Apache Combined log format.



This means that you do not have to parse the lines by hand. However, if you are working with a different format of web logs, you may need to use the regexp extractor instead.

Cleaning Data

Let's create a new preparation script. You can immediately see that the referer contains the value '-', coding for a missing value. Simply click on one of this cell and use the "Clear cells with this value" processor. You can also use the "Remove rows with this value" processor on the ip 127.0.0.1 which corresponds to the server itself.

Now in this tutorial we are only interested in user path so the only requests we want to keep are the ones concerning new web pages. A good way to do that would be to suppress all request containing "." like ".png", ".jpg" or ".js". To do so:





Enrich Data

Now that the data is a bit cleaner, let's enrich it with geographic, time and user_Agent information. Start by clicking on the ip column name and select the "Resolve GeoIp" suggestion from the contextual menu.



DSS enables you to extract geolocate each ip addres. Select at least country code and lat/lng.

Now let's add a time component analysis. Click on "apachetime" columns and select "Parse Date". Call the newly created column "apachetime". For more information on the parsing of dates in DSS, refer to this how to.

Now let us add some information about the referer. The referer is used to identify the address of the web page that linked to the ressource being requested. Let's click on the referer column and use the suggested "Split into host, port ..." processor.

Finally let's enrich the data by getting human readable information from the UserAgent. Click on the userAgent column name and use the "Classify user_Agent" procesor.

Part 2: Recreating sessions and deriving a few kpi

Part 3: Basic customer path analysis