Logs can be hard to parse. Even for standard format like apache logs files, one often need to code several regular expressions to be able to extract all the information he needs.
As a toy example, we provide this randomly generated dataset.
If you upload the data and go to the preview page, you can see that Dataiku DSS detects automatically the format "Apache combined log format" and does the log parsing for you. The common fields are retrieved: the ip address, the user (not available in this example), the timestamp, the requested url, the request status code , the size of the object returned, the referer (from where the user made the request), and the user agent (device used by the user).
Let's create the dataset and a new Analysis on top of it. We will see that DSS enables you to clean or enrich all the newly created fields.
Let's enrich the ip address first with geolocalization information. To do so, click on the ip column and select "Resolve GeoIP".
Select only the information of interest. For example: the country and the GeoPoint.
At this point, we already have enriched our logs with a country and position information without having to code anything. The country (as well as other geographic dimensions) could be used for aggregations. If you have installed the right plugins, you can also easily recreate this chart.
The user agent information is paramount in any web log analysis. Is the user using a computer, a mobile phone or a tablet? Which browser is the most used on my website? Which one generates the more error status? Is there a correlation between the device used and the probability of a sale? These are questions you can answer when the user agent has been interpreted.
Again, this field is not easy to parse by hand. Luckily DSS has a dedicated processor. Let's click on the "user_Agent" column and on the "Classify User-agent" suggestion.
We can remove the columns that are less interesting, and keep only the user agent type, brand and operating system. It seems that we only have bot or browser here, most of them used via Internet Explorer on Windows.
The referer tells you where your customers are coming from. The referer field is detected by DSS as being a URL, though a small percentage of them seems to be invalid (red part of the quality assessing bar). Let's filter by invalid values: click on the "referer" column, and select the "On validity" / "Invalid rows only" in the "Filter" part of the menu.
Now, we have only one line left. It seems that the invalid field contains the string "-". A contextual menu pops up when selecting this value. Select "Clear cells with this value" to have a cleaned referer, ready to be parsed.
Let's open again the contextual menu by clicking on the referer column, and choose "Split into host, port ... "
Several new fields appear:
Select manually a '/' in one referer_path cell and click on "Split column on '/'". As this could create a lot of columns, notice that it is possible to use the "Truncate" option in the processor to keep only the root columns of the path. Now, the analysis module reveals that most requests come from the products, blog and applications part of the website.
Some transformations are not logs specific but make sense here.
That's it! Your preparation script is ready to be deployed. You can do this by clicking on the dedicated button on the top right. If you were to have a big filesystem to be synchronised in hdfs, do not forget to choose hdfs as the output connection. Note that these transformations could also be run partition by partition if needed.
Now that your web logs are cleaned and enriched, it may be time to try to sessionize them. Do to this, follow this how to. Otherwise, it may be better to sessionize the data first and then to enrich it once aggregated (on session or user), with geolocalization and user agent information for instance. This could save a lot of space.