In DSS, data analysts manipulate and interact with datasets, in projects.
Within a project, there are two distinct parts:
- The Lab
- The Flow
The Lab is where you experiment, iterate, analyze, explore.
Work done in the lab is "not in production". It is iterative work by nature, representative of the exploration part of the work of a data scientist.
DSS offers two kinds of environments to work in the Lab:
- Visual analysis, a viual workspace where you can interactively prepare your data, visualize it and create machine learning models
- Code-based notebooks, for exploration and analysis of your data using Python, R, SQL, Hive or Impala.
Starting work in the Lab
To work in the Lab, click on the "Lab" button. The Lab button is available:
- From the explore screen of a dataset
- From the Flow, when you click on a dataset: the Lab button appears in the Actions sidebar
- From the Flow, when you right-click on a dataset: the Lab button appears in the contextual menu
The Lab window opens
You can then either start a new visual analysis, a new code-based or notebook, or go back to an already-created element.
The Visual analysis is the main workspace for visual work in the Lab. It has three main functions:
Interactive data preparation, to prepare, clean, and enrich your data. For more information, see our Quickstart on data preparation
Data visualization, based directly on the prepared data.
Training Machine Learning models, based directly on the prepared data. For more information, see our Quickstart on machine learning
Code notebooks let you explore and analyze your datasets through interactive code environments
Python and R
Python and R support is provided via an integrated Jupyter Notebook within DSS.
In DSS, the persistent data manipulations (cleansing, aggregation, joining, etc) are performed within recipes which take datasets as inputs and outputs.
The lineage of a dataset (or a model) is thus defined by the inputs and outputs of its ancestors recipes. The overall view of the dependency structure of a project is accessible in the Flow tab:
The knowledge of these dependencies helps the DSS engine to minimize the amount of data processes to be launched when (re)building a dataset.
The Flow can be considered as everything which is "in production", "active" in DSS, compared to the "experimentation" in the Lab.
The main building blocks of the Flow are thus:
- Managed folders
- Saved models