Enriching Web Logs

August 05, 2015

Logs can be hard to parse. Even for standard format like apache logs files, one often need to code several regular expressions to be able to extract all the information he needs.

Importing the data

As a toy example, we provide this randomly generated dataset.

If you upload the data and go to the preview page, you can see that Dataiku DSS detects automatically the format "Apache combined log format" and does the log parsing for you. The common fields are retrieved: the ip address, the user (not available in this example), the timestamp, the requested url, the request status code , the size of the object returned, the referer (from where the user made the request), and the user agent (device used by the user).

Let's create the dataset and a new Analysis on top of it. We will see that DSS enables you to clean or enrich all the newly created fields.

Geo ip enrichment

Let's enrich the ip address first with geolocalization information. To do so, click on the ip column and select "Resolve GeoIP".

Select only the information of interest. For example: the country and the GeoPoint.

At this point, we already have enriched our logs with a country and position information without having to code anything. The country (as well as other geographic dimensions) could be used for aggregations. If you have installed the right plugins, you can also easily recreate this chart.

User agent parsing

The user agent information is paramount in any web log analysis. Is the user using a computer, a mobile phone or a tablet? Which browser is the most used on my website? Which one generates the more error status? Is there a correlation between the device used and the probability of a sale? These are questions you can answer when the user agent has been interpreted.

Again, this field is not easy to parse by hand. Luckily DSS has a dedicated processor. Let's click on the "user_Agent" column and on the "Classify User-agent" suggestion.

We can remove the columns that are less interesting, and keep only the user agent type, brand and operating system. It seems that we only have bot or browser here, most of them used via Internet Explorer on Windows.

Referer URL parsing

The referer tells you where your customers are coming from. The referer field is detected by DSS as being a URL, though a small percentage of them seems to be invalid (red part of the quality assessing bar). Let's filter by invalid values: click on the "referer" column, and select the "On validity" / "Invalid rows only" in the "Filter" part of the menu.

Now, we have only one line left. It seems that the invalid field contains the string "-". A contextual menu pops up when selecting this value. Select "Clear cells with this value" to have a cleaned referer, ready to be parsed.

Let's open again the contextual menu by clicking on the referer column, and choose "Split into host, port ... "

Several new fields appear:

  • the scheme contains the communication protocol (http , https)
  • the host field has the website from which the request was done. A quick analysis (click on column name and "Analysis") reveals that most queries come from the website
  • the referer_path is not as easy to analyse as is, but we can use its hierarchical structure to make more sense out of it.

Select manually a '/' in one referer_path cell and click on "Split column on '/'". As this could create a lot of columns, notice that it is possible to use the "Truncate" option in the processor to keep only the root columns of the path. Now, the analysis module reveals that most requests come from the products, blog and applications part of the website.

Other transformations

Some transformations are not logs specific but make sense here.

  • We could extract interesting information by parsing the date of the log. Is the hour of the day relevant for your trafic analysis? Probably, and so is the day of the week and the bank holidays.
  • We could parse further the request the same way we did for the referer_path. To do so, select the space character between the HTTP method (GET, POST, HEAD, etc.) and the path. Three new columns are generated. The second one corresponds to the path requested by the user. With the referer path, this is the column that enables you to analyse and understand the path taken by your client on your website.
  • In our example, there is no HTTP query string, but this could contain valuable information from dynamic, user-inputted data. Feel free to try the corresponding DSS processor on your own data.

That's it! Your preparation script is ready to be deployed. You can do this by clicking on the dedicated button on the top right. If you were to have a big filesystem to be synchronised in hdfs, do not forget to choose hdfs as the output connection. Note that these transformations could also be run partition by partition if needed.

Now that your web logs are cleaned and enriched, it may be time to try to sessionize them. Do to this, follow this how to. Otherwise, it may be better to sessionize the data first and then to enrich it once aggregated (on session or user), with geolocalization and user agent information for instance. This could save a lot of space.