howto

Prepare recipe - using a custom Python function as a processor

June 27, 2017

The Python function processor allows you to perform complex row-wise operations on your dataset. The output of the function depends on the mode used.

Cell mode

In cell mode, Dataiku expects a single column as output. This is useful for finding the largest value across columns, or the sum of a set of columns whose names follow some pattern.

In the example below, the process function finds the longest string value across the columns in the dataset.

def process(row):
    # In 'cell' mode, the process function must return
    # a single cell value for each row
    # The 'row' argument is a dictionary of columns of the row
    max_len = -1
    longest_str = None
    for val in row.values():
        if val is None:
            continue
        if len(val) > max_len:
            max_len = len(val)
            longest_str = val
    return longest_str

Row mode

In row mode, Dataiku expects a row as output, which replaces the existing row. This is useful when you need to perform operations in place on the rows (as in the image below), or when you need to add multiple rows to the dataset (such as if you compute the mean and standard deviation of values in a row across columns).

In the example below, the process function adds 5 to the integer-valued columns. Note that the row parameter passed to the process function is a dictionary of the string values, so we use the ast package to evaluate whether they are integer-valued.

import ast
def process(row):
    # In 'row' mode, the process function must return the full row.
    # You may modify the 'row' in place to
    # keep the previous values of the row.
    for i in row.keys():
        try:
            isint = type(ast.literal_eval(row[i])) is int
        except:
            isint = False
        if isint:
            row[i] = int(row[i]) + 5
    return row

Rows mode

In rows mode, Dataiku expects one or more rows as output, which replace the existing row. This is especially useful when you need to transform your data from wide to long format.

In the example below, the process function splits each row into two. Where before a single row contained both a person’s work and home zip code and state, now each row contains either the home or work information, along with a new column that indicates whether it is the home or work information.

# Modify the process function to fit your needs
def process(row):
    # In 'multi rows' mode, the process function
    # must return an iterable list of rows.

    ret = []

    home = {"name": row["name"],
            "type": "home",
            "zip": row["home_zip"],
            "state": row["home_state"]}

    work = {"name": row["name"],
            "type": "work",
            "zip": row["work zipcode"],
            "state": row["work state"]}

    ret.append(home)
    ret.append(work)

    return ret