concept

The Main Dataiku DSS Concepts

May 04, 2018

Data Stack

Dataiku DSS is an on-premises/on-cloud software product (not SaaS) that operates as part of your data stack’s existing infrastructure.

A data stack typically includes development, production, and deployment environments; in order to work across these environments, a separate instance of Dataiku DSS is installed in each environment.

Instances

A Dataiku DSS instance is an installation of the product to serve the needs of a particular environment:

  • The Design node instance, in the development environment, is used to create the pipelines that turn data into outputs [Dashboards/Reports, Data (to build reports), Models]
  • The Automation node instance, in the production environment, puts pipelines from the Design node into production to turn your enterprise data into the final outputs
  • The API node instance, in the deployment environment, makes model outputs from the Automation node available for use in real-time scoring

Pipelines in the Design and Automation nodes are organized into projects, which can be accessed from the main page after logging in to the Dataiku DSS instance.

Projects overview page in Dataiku DSS

Projects

A Dataiku DSS project is a container for all your work on a particular activity. The project home acts as the command center from which you can see the overall status of a project, view recent activity, and collaborate through comments, tags, and a project to-do list.

Project home page

Datasets     Datasets navigation bar

A Dataiku DSS dataset is a tabular view into your data that allows you to access, visualize and write data in the same way, regardless of the underlying storage system. You can connect to a variety of storage systems (file system, SQL database, Hadoop, etc), and file formats (CSV, JSON, Hadoop file formats, etc).

Creating your first DSS dataset and learning how to cleanse it is the subject of the Basics Tutorial.

Flow & Recipes     Flow navigation bar

A Dataiku DSS recipe is a set of actions to perform on one or more input datasets, resulting in one or more output datasets. Each time you prepare, join, group… your datasets, this will be through a recipe. A recipe can be visual or code.

  • A visual recipe allows a quick and interactive transformation of the input dataset through a number of prepackaged operations available in a visual interface.
  • A code recipe allows a user with coding skills to go beyond visual recipe functionality to take control of the transformation using any supported language (SQL, Python, R, etc).

Dataiku allows “coders” and “clickers” to seamlessly collaborate on the same project through code and visual recipes.

The lineage of a dataset (or a model) is thus defined by the inputs and outputs of its ancestor recipes. The Flow is a visual representation of your work as a set of dependencies between datasets and the recipes used to produce them.

Project flow

The knowledge of these dependencies helps the Dataiku DSS engine minimize the number of data processes to be launched when (re)building a dataset.

Lab - Visual Analysis     Analyses navigation bar

The Visual Analysis lab allows you to experiment with your data in a code-free environment where you can:

  • Perform interactive analysis with built-in charts and data preparation (cleaning, filtering, enriching). These steps can be deployed to the Flow as Prepare recipes.
  • Use machine learning algorithms (unsupervised and supervised training) to generate insights and build predictive models. These models can be deployed to the Flow.

We strongly invite you to follow the Tutorial: Basics to discover how tackling data problems can be fluid within the DSS Analysis.

Lab - Code     Notebooks navigation bar

The code lab allows you to experiment with your data in Jupyter notebooks (for Python / R) or SQL notebooks when working with SQL DB’s, Hive, or Impala. You can perform interactive analysis in these notebooks and then deploy them to the Flow as code recipes.

You can also use the code lab to create interactive web apps with R Shiny, Python Bokeh, or Javascript. Templates and code samples are provided to help you get started. Web apps can be shared on dashboards.

The code lab also provides the ability to create some advanced R Markdown reports that mix text, code, and complex visualizations using Python and R. These reports can be shared on dashboards or distributed in various printable formats.

Jobs & Scenarios     Jobs navigation bar

Jobs are created when you build a dataset. Dataiku DSS provides a full job log to let you monitor what works and what does not, along with the ability to debug potential errors.

Scenarios help you automate reconstruction tasks; for example, running daily updates to your models to always have up-to-date predictive scoring. Reports on scenarios that ran previously and their results are shown in Monitoring.

Dashboard & Insights     Dashboards navigation bar

The Dashboard is a communication tool to organize, share, and deliver the Insights of your data project. Insights can include any Dataiku DSS object, such as charts, datasets, web apps, and reports.

Dashboard example: Descriptive Statistics on the Customers