Connect To Data

Connect to your existing infrastructure

SQL databases

The list of supported SQL databases is available from our documentation on SQL datasets.

Detailed guides exist for some of these:

Hadoop HDFS

To connect you will first need to configure Hadoop on your instance.

Detailed guides on specific Hadoop distributions and managed services:


Accessing cloud Storage and Databases

Cloud File Storage

Cloud Databases

  • Working with Redshift for running large scale analyses is described in this howto. Syncing from S3 to Redshift is most efficient, and Dataiku DSS takes this route whenever possible, see this page for details.
  • Google BigQuery is available through a JDBC driver developed by Simba.

Fetching data from remote sources

It is possible to fetch data using various protocols, and caching the resulting dataset on the filesystem.

File formats

Dataiku can read and write in various file formats for files-based connections: filesystem, HDFS, Amazon S3, HTTP, FTP, SSH… The list of readable file formats also includes shapefiles.

Accessing data through plugins

Many applications such as Google Sheets, SalesForce, Slack… provide capabilities to access their data through APIs.

Dataiku DSS plugins allow the addition of custom connections leveraging these APIs to easily define datasets that fetch data from a wide variety of applications.

See the available plugins or create your own plugin.