Portal

Big Data Architecture

Dataiku DSS Architecture

Pushing computation to your data

Any Dataiku DSS tool, whether it is visual data manipulation recipes, a code recipe, guided machine learning or data visualizations, can be run using an in-cluster engine. Dataiku DSS leverage various technologies (Hive, Impala, Spark, MLlib, H2O…) to achieve this.


Machine learning engines

Machine learning algorithms can be run distributed both for training and for scoring using these engines:

Spark MLlib H2O Vertica Advanced Analytics

Data manipulation and visualization engines

Day-to-day usage

For everyone

It’s all magic: prepare your data and create machine learning models as you would do for in-memory processing. Dataiku will take care of the plumbing to make it happen.


For coders

Create recipes using your favorite language Hive, Impala, Pig, SparkR, PySpark, SparkSQL, or Spark Scala. You can also code in a notebook.


Optimize your Spark jobs


Working with partitions

Optimize the speed of your computation with partitions on HDFS.

Setup

Connect to Hadoop

The host running DSS should have client access to the cluster. Learn how to set up the Hadoop integration.


Spark integration

To set up your Spark environment and enable its integration in DSS, please refer to this page of our documentation.