howto

Coding Python in Dataiku DSS

February 01, 2017

Python code can be written in a code recipe or a notebook.

Data processing

In-memory

You can read data as a pandas dataframe or use an iterator in order to stream data. Similarly, you can write data from a complete dataframe or by pushing one row at a time. See here how to process data in-memory.

In-database

You can generate complex SQL queries in Python and then execute them. One can also retrieve the results as a pandas dataframe if necessary. See here how to use SQL from Python.

In-cluster

In order to process your data using Hadoop, you can use Spark through Python. Take a look at how to code within a PySpark Recipe. Note that it is also possible to use PySpark in a notebook.

Starter code

Dataiku provides a lot of code snippets to start with:

  • Whenever you are coding, there is a Sample Code button on the top right of the editor with a list of code snippets. You can also add your own!
  • Upon notebook creation, you can use a predefined Notebooks with template code or create your own.
  • Within the guided visual machine learning, you can generate a notebook from a model. That will provide you with some starter code to tweak it further.

Advanced topics

More on Data Access and Processing

Read the full internal Python API documentation

Environment

DSS public API

The DSS public API allows you to interact with DSS from any external system. It allows you to perform a large variety of administration and maintenance operations, in addition to accessing datasets and other data managed by DSS.

An example of usage is to administer Dataiku DSS using the Python client.