This article is partially based on one of our blog posts
Dataiku DSS covers a very large variety of data transformation and manipulation work.
It offers a map of the data science worlds in simple, recognizable visual elements.
Let's look at everyday job of data workers:
- They load their data into various databases
- They create processing operations that overlap each other
- They create displays and share them with business experts for validation
- They create models and put them into production
- They monitor the predictive application
All the visual grammar of DSS is made to reflect these business operations through structural graphics.
For example, let's look at the following application that predicts a churn score for telecom users. It is represented by a pipeline of datasets:
In yellow, we have the visual transformation steps. The icons illustrate the nature of this transformation (join, group, preparation, etc.). In this case, they are all the same: preparation.
In blue, we have the datasets. The icons illustrate the nature of the database in which the associated data lives (hadoop, SQL database, filesystem, remote ftp, etc.).
In green, we have the machine learning elements. The icons represent the step (model training, prediction or scoring).
Note that transformations are in circular elements while persistable elements (data, models) are in square elements (or diamonds like diamond Shreddies?).
Here's another pipeline for a log analysis application:
This time the dominant color is orange: the color of the code. The icons of these transformations tell us that the languages used are Pig (the little pig's head) and Hive (beehive). The datasets live on Hadoop (the elephant).
Now let's select one of the datasets in the pipeline. A bar populated with icons appears on the right.
Here, there are several colors:
Yellow, these are the data recipes called visual recipes. These recipes are used to operate the most common transformations without coding, with a graphical user interface.
In orange, there are the different languages which can be used with the studio. We see that Python and R are dark orange. They are active. The dataset selected has its data living on the file system. So we can not use SQL, Hive or Pig unless we copy this data into an appropriate database. Had they been in hadoop, all languages would have been active.
In red are the items that can be used for communication, such as charts that can be put on a dashboard.
At the top, the gray icons present the most standard actions related to this dataset: explore, download, (re)build, etc.
The lab icon for visual or code-based exploration and analysis
The bar on the right is available throughout the studio and guides the user by showing the list of actions that are available from the item they are displaying. It makes the user's everyday life easier by filtering out the choices that would be inappropriate and enables new users to be guided while they are building their first predictive application.
In addition to the data pipeline, the objects, Dataiku DSS tools are stored in worlds that follow also these same colorful conventions. Each world is materialized in the top bar next to the project name for an icon:
Here, we find:
|The world of the data pipeline with transformation recipes.|
|The world of datasets.|
|The world of visual analyses in which predictive models are built.|
|The world of code notebooks (Python, R, SQL).|
|The world of application monitoring (a new color!)|
|The world where you store all shared items at the end of communications between team members on a dashboard.|
And there you go. It's just as colorful as the map of subway lines but everyone can find their way without having the station list in front of them!
Happy travels with the Dataiku DSS! Download (free) is available here.