concept

The main Dataiku DSS concepts

May 22, 2015

Projects

In Dataiku DSS, you organize datasets and associated tasks into separate projects:

Projects overview page in DSS

Projects help you to manage:

  • the functional separation of data assets and associated tasks
  • organizational security, thanks to per-project user access management.

Within a project, the items that you manipulate are accessed within 6 “universes” mapping the main Dataiku concepts:

  • Within each universe, related items are organized graphically or by lists
  • Each universe is accessed by clicking on the corresponding top level icon in the navigation bar right next to the project title.

A project home page in DSS, with the navigation bar marked out

Datasets     Datasets navigation bar

A Dataset (in the Dataiku sense) is a series of rows with the same data structure. The underlying data:

  • can lie on various storage systems (file system, SQL database, Hadoop, etc) to which DSS is connected,
  • can have an associated file format (CSV, JSON, Hadoop file formats, etc).

The Dataiku DSS Dataset Abstraction Layer allows users to access, visualize and write the data in a unified way whatever the storage system.

Creating your first DSS dataset and learning how to cleanse it is the subject of the Basics Tutorial.

Flow & Recipes     Flow navigation bar

In DSS, data manipulation (cleansing, aggregation, joining, etc) is performed within Recipes which take datasets as inputs and outputs.

The lineage of a dataset (or a model) is thus defined by the inputs and outputs of its ancestor recipes. The overall view of the dependency structure of a project is accessible in the Flow tab:

Predicting churn project flow

The knowledge of these dependencies helps the DSS engine minimize the number of data processes to be launched when (re)building a dataset.

Lab - Visual Analysis     Analyses navigation bar

Data Science in real life is full of dirty data. Data tacklers’ daily tasks include:

  • cleansing the data,
  • creating features,
  • building visualizations,
  • creating and assessing multiple ML models.

Within the Lab, Dataiku DSS provides a dedicated module called Visual Analysis to quickly iterate over these tasks. Visual Analysis allows you to experiment with your data in a code-free environment prior to deployment in the Flow in order to efficiently build your data driven applications.

We strongly invite you to follow the Tutorial: Basics to discover how tackling data problems can be fluid within the DSS Analysis.

Lab - Notebooks     Notebooks navigation bar

Some people prefer to do their analyses using code. Dataiku DSS is shipped with interactive development environments that are called Notebooks.

These can be used either

  • to draft code in Python, R, Scala (Spark), or SQL (including Hive and Impala),
  • to create some advanced reports mixing text and complex visualizations using Python or R.

Users with Web coding skills can create advanced custom Web Apps using our dedicated editor and REST API. Templates and code samples are provided to help you get started. Head to the dedicated howtos to learn more!

Jobs & Scenarios     Jobs navigation bar

The secret to efficiently taking advantage of your data assets lies in the ability to (re)play the full pipeline of an analysis and to always have up-to-date predictive scoring.

Monitoring associated tasks is accessed in the jobs tab. Every time you build a dataset, Dataiku DSS creates a new Job with all the build dependency information defined in the Flow.

Scenarios help you automate these reconstruction tasks; for example, running daily updates to your models. Reports on scenarios that ran previously and their results are shown in Monitoring.

Dashboard & Insights     Dashboards navigation bar

The DSS Dashboard is a communication tools to organize, share or deliver the Insights on your data (charts, datasets, static reports, etc).

Dashboard example: Descriptive Statistics on the Customers

On the Dashboard, the team structures their findings and the final data consumers get their updated summary.