The Lab and the Flow

February 05, 2016

In DSS, data analysts manipulate and interact with datasets, in projects.

Within a project, there are two distinct parts:

  • The Lab
  • The Flow

The Lab

The Lab is where you experiment, iterate, analyze, explore.

The lab icon

Work done in the lab is “not in production”. It is iterative work by nature, representative of the exploration part of the work of a data scientist.

DSS offers two kinds of environments to work in the Lab:

  • Visual analysis, a viual workspace where you can interactively prepare your data, visualize it and create machine learning models
  • Code-based notebooks, for exploration and analysis of your data using Python, R, SQL, Hive or Impala.

Starting work in the Lab

To work in the Lab, click on the “Lab” button. The Lab button is available:

  • From the explore screen of a dataset
  • From the Flow, when you click on a dataset: the Lab button appears in the Actions sidebar
  • From the Flow, when you right-click on a dataset: the Lab button appears in the contextual menu

The Lab window opens

The lab window

You can then either start a new visual analysis, a new code-based or notebook, or go back to an already-created element.

Visual analysis

The Visual analysis is the main workspace for visual work in the Lab. It has three main functions:

The visual analysis tabs

  • Interactive data preparation, to prepare, clean, and enrich your data. For more information, see our Portla on data preparation

  • Data visualization, based directly on the prepared data.

  • Training Machine Learning models, based directly on the prepared data. For more information, see our Portal on machine learning

Code notebooks

Code notebooks let you explore and analyze your datasets through interactive code environments

Python and R

Python and R support is provided via an integrated Jupyter Notebook within DSS.

The Flow

In DSS, the persistent data manipulations (cleansing, aggregation, joining, etc) are performed within recipes which take datasets as inputs and outputs.

The lineage of a dataset (or a model) is thus defined by the inputs and outputs of its ancestors recipes. The overall view of the dependency structure of a project is accessible in the Flow tab:

The Flow

The knowledge of these dependencies helps the DSS engine to minimize the amount of data processes to be launched when (re)building a dataset.

The Flow can be considered as everything which is “in production”, “active” in DSS, compared to the “experimentation” in the Lab.

The main building blocks of the Flow are thus:

  • Datasets
  • Recipes
  • Managed folders
  • Saved models