Get Started

Big Data Architecture

Dataiku DSS Architecture

Pushing computation to your data

Any Dataiku DSS tool, whether it is visual data manipulation recipes, a code recipe, guided machine learning or data visualizations, can be run using an in-cluster engine. Dataiku DSS leverages various technologies (Hive, Impala, Spark, MLlib, H2O…) to achieve this.

Machine learning engines

Machine learning algorithms can be run distributed both for training and for scoring using these engines:

Spark Machine Learning Library (MLlib)Spark MLlib


Vertica Advanced AnalyticsVertica Advanced Analytics

Data manipulation and visualization engines

Day-to-Day Usage

For everyone

It’s all magic: prepare your data and create machine learning models as you would do for in-memory processing. Dataiku will take care of the plumbing to make it happen.

For coders

Create recipes using your favorite language HiveImpalaPigSparkRPySparkSparkSQL, or Spark Scala. You can also code in a notebook.

Optimize your Spark jobs
  • Reduce Spark engine overhead and unecessary intermediary datasets write thanks to the Spark pipelines
  • Become a Spark master by learning the Spark tips and troubleshooting
Working with partitions

Optimize the speed of your computation with partitions on HDFS.


Connect to Hadoop

The host running DSS should have client access to the cluster. Learn how to set up the Hadoop integration.

Spark integration

To set up your Spark environment and enable its integration in DSS, please refer to this page of our documentation.