Any Dataiku DSS tool, whether it is visual data manipulation recipes, a code recipe, guided machine learning or data visualizations, can be run using an in-cluster engine. Dataiku DSS leverages various technologies (Hive, Impala, Spark, MLlib, H2O…) to achieve this.
Big Data Architektur
Dataiku DSS Architecture
Pushing computation to your data
Machine learning engines
Machine learning algorithms can be run distributed both for training and for scoring using these engines:
Spark MLlib
H2O
Vertica Advanced Analytics
Data manipulation and visualization engines
- SQL visual recipes can use Hive, Impala, and Spark engines.
- Data visualization can use Impala.
Day-to-Day Usage
For everyone
It’s all magic: prepare your data and create machine learning models as you would do for in-memory processing. Dataiku will take care of the plumbing to make it happen.
For coders
Optimize your Spark jobs
- Reduce Spark engine overhead and unecessary intermediary datasets write thanks to the Spark pipelines
- Become a Spark master by learning the Spark tips and troubleshooting
Working with partitions
Optimize the speed of your computation with partitions on HDFS.
Setup
Connect to Hadoop
The host running DSS should have client access to the cluster. Learn how to set up the Hadoop integration.
Spark integration
To set up your Spark environment and enable its integration in DSS, please refer to this page of our documentation.