As the name suggests, Dataiku Data Science Studio (DSS) is a tool that helps data teams increase their efficiency when prototyping and deploying data driven applications.
From what we are seeing on the market, building a successful data-centric organization is usually (at least) a two-part effort:
a top-down strategy, creating a horizontal data science team responsible for solving business challenges through data products - the Data Lab
a bottom-up strategy, empowering analysts and BI specialists at every level and in every business unit with the ability to conduct advanced analyses on ever more complex data
Through its Visual Analysis layer, Dataiku DSS features a number of very useful features to empower analysts to explore, prepare, enrich and visualise various types of structured and unstructured data.
The philosophy behind this is threefold:
with more volume and more complexity in incoming data, traditional analytics tools (like Excel) are showing some limitations when addressing advanced analytics use cases
part of the efficiency of a data team comes from being able to organise collaboration between different profiles (read: less data cleaning by data scientists, more advanced analyses done directly by analysts)
business analysts are usually the ones with the best understanding of the business challenges and the data, so empowering them with advanced self service analytics tools can lead to high impact recommendations
In this post I will take you through some of my favorites code-free features.
Usually, the first step to working with data is... getting data (duh). To help with that, Dataiku DSS has a number of helpful features and connectors that will let you upload datasets and work with XLSX and CSV files, or connect easily to various types of data sources (databases, server-hosted files, connections to business applications…).
Setting up data connections in Dataiku DSS
For instance, when you upload a cvs or excel file, it automatically recognises separators, character encoding (utf-8 headaches anyone ?), will display a preview of the dataset for you to check if everything is in order, will indicate number of lines, and let you change some parameters (skipping lines, handling column headers…) to make sure that the data is consistent. One feature that I find really useful is the ability to automatically stack identical files into one dataset by drag-and-dropping them into the interface and to merge excel tabs into one dataset.
Merging Excel tabs in Dataiku DSS
Stacking Excel files in Dataiku DSS
You can also rename columns and set data types directly in the schema panel to force DSS to store / handle certain columns in specific formats.
Time-based features are commonplace in data driven use cases, and they can be a real pain to work with. Depending on the original format, you might have to do some heavy recoding in order to parse the dates into a recognised date feature. In DSS, the “Smart Date” processor will recognise probable date formats and suggest different parsing options, showing you how well each option performs.
If the automatic parsing doesn’t work, you’ll also be able to show DSS what the original format looks like so that it can transform it.
Smart Date Processor
Working with dates in Dataiku DSS
Also a very useful thing when preparing for modelling, you can enrich your data and create time-based features and in a few clicks: extracting date components (month, hour, day of week, week of year…), calculating differences between date columns, flagging national holidays, etc…
Time and date processors in Dataiku DSS
Creating time-based features automatically in Dataiku DSS
It looks bad, and short of using complex Regular Expressions, there is no easy way to clean and structure that data. In DSS, using a combination of Text Cleaning processors (Split, Find and Replace, Truncate), I was able to quickly extract and create new columns containing the information I wanted.
Dataiku DSS visual data preparation
This is a good example of how data team collaboration can work: I needed the help of a data scientist to crawl the data (he used Python), but I was able to clean the data, create the output dataset and use it for analysis on my own.
I have seen several occasions where a good crawl of all the tweets related to a given theme gave me (or the client I was working for on edge. Although there are a lot of specific tools to do this and conduct advanced analysis on the tweets, DSS features a simple connector that will allow you to call the REST API and retrieve tweets and related information (user handle, location, hashtags…) based on keywords or hashtags.
Setting up twitter keywords in Dataiku DSS (1/2)
Setting up Twitter keywords in Dataiku DSS (2/2)
And working with the output Dataset
Working with Twitter data in Dataiku DSS
Once you have this, you can use Text Analysis features to cluster similar tweets, split in words or n-grams, simplify and remove stop words, all of which can be a good first step towards doing some sentiment mining.
Who said you needed to be a data scientist to use machine learning techniques? DSS allows you to train and deploy algorithms without writing a single line of code, and start making predictions, identifying clusters, or extracting useful information about the features that are correlated in my data. Even without a thorough understanding of the underlying math, the tools for model diagnostics and refinement can be used to improve models and share recommendations.
Choosing a machine learning task in Dataiku DSS
I often use it myself to participate in data science competitions for fun (Kaggle, datascience.net), and even though I have no hope of being in the top half of the contestants, I consistently beat the algorithmic benchmarks.
Model diagnostic in Dataiku DSS - features importance
Once again, in a data team environment, this can allow a business analyst to benchmark different types of algorithms, get a quick baseline model and in some cases, issue recommendations (such as the impact of adding new data sources, need for additional data preparation / feature engineering…)
Quite often, enriching data can be done through joining datasets - essentially, retrieving columns from one dataset or tab into a reference dataset (vlookuping). This is a key element of any analysis but can quickly become a nightmare when you have several sources (both in terms of computation time and joining criterion).
A data science workflow in Dataiku DSS
In Dataiku DSS, blending data sources is simplified, both in Visual Analysis, using the “Join” or “Fuzzy Join” processors to retrieve data from other datasets, or using the specific Join recipe where joining keys and criteria can be fine-tuned.
Use cases that require using geospatial analyses are numerous: optimising a network of rental agencies, mapping your competition, sizing a target market, etc... Dataiku DSS features a few processors that facilitate working with locations, most notably;
Retrieving a latitude and longitude from an address with the OpenStreetMap or Bing Maps API (requires token)
Enriching a Latitude/Longitude with administrative information (city, state, department…)
Geographic data visualization in Dataiku DSS (1/2)
The Dataiku DSS chart engine also packages the ability to draw scatter maps and heatmaps with various levels of aggregation possible.
Geographic data visualization in Dataiku DSS (2/2)
As a bonus, one of our data scientists recently created a plugin based on the Mapbox API that can calculate driving times between two points and draw isochrones around a location.
Extending Dataiku DSS features with plugins
If you are still reading, I'm pretty sure you are now eager to try Dataiku DSS. It's your lucky day! Click here to try Dataiku DSS for free.
Please fill out the form below to receive the success story by email:
How can we come back to you ?