howto

Become a Master of Dataiku DSS Formulas

Applies to DSS 2.3 and above | February 10, 2016

DSS includes a language to write formulas, much like a spreadsheet.

This How-To provides concrete examples of formulas that can help you in addition to the Reference documentation.

Formulas can be used:

• In data preparation, to create new columns, filter rows, or flag rows

• More generally, to filter rows in many places in DSS:

• In the Filtering recipe, to filter rows
• In Machine Learning, to define the extracts to use for train and test set
• In the Python and Javascript APIs, to obtain partial extracts from the datasets
• In the Public API, to obtain partial extracts from the datasets
• In the Group, Window, Join and Stack recipes, to perform pre and post filtering

Basic usage

Formulas define an expression that is applied row by row.

Assuming that you have a dataset with columns N1 (numeric), N2 (numeric), and S (string), here are a few example formulas:

• 2 + 2
• N1 + N2
• min(N1, N2) # Returns the smaller of N1 and N2
• replace(S, 'old', 'new') # Returns the value of S with ‘old’ replaced by ‘new’
• if (N1 > N2, 'big', 'small') # Returns big is N1 > N2, small otherwise

We recommend that you start by reading the introduction paragraphs of the reference documentation before continuing this How-To.

The reference guide has information about typing, arrays, objects, …

Fun with arrays and objects

DSS lets you manipulate columns containing arrays ([0,1,2]) or objects ({"firstname": "John", "lastname" :"Smith"}).

Arrays and objects are represented using JSON notation. The formula language includes a lot of tools to manipulate arrays and objects.

In the following section, we’ll assume we have an input that looks like that:

client_detailsrequest_detailsstranuma
"visitor_id" : "bdc456ef"}
{"timings" : [1232, 2034, 2351],
"page_data":
[{"url" :"/u1", "rank": 1},
{"url" : "/u2", "rank" :2},
{"url" : "/u3", "rank" : 3},
]
["Big", "Medium", "Small"]
[0, 1, 2, 5]

Access elements

Formulas support accessing array elements and object keys using the traditional Python/Javascript syntax:

• array
• object["key"]
• object.key (only valid if ‘key’ is a valid identifier, i.e. matches [A-Za-z0-9_]*)

Important notes: array and object columns are not parsed by default in DSS. They are only automatically parsed in functions that use arrays. However, index and key access are not functions, so we’ll need to use parseJson to convert our columns to objects.

You’ll find more information in the Reference documentation for formulas

Thus:

• parseJson(request_details)["page_data"]: Extract the page_data array
• parseJson(request_details)["page_data"]: Extract the second element of the page_data array
• parseJson(request_details)["page_data"]["url"]: Extract the URL of the second element of the page_data array

Use ‘with’ to simplify an expression

The with control lets you bind a variable to an expression, to simplify an expression. For example, if we wanted to sum the first two ranks, we’d need: parseJson(request_details)["page_data"]["rank"] + parseJson(request_details)["page_data"]["rank"].

But using with, it simply becomes:

# Alias parseJson(request_details)["page_data"] as "pages"
with(parseJson(request_details)["page_data"], pages,
pages["rank"] + pages["rank"]
)

Filter an array

The filter control allows you to filter array elements based on a predicate:

The syntax is : filter(array a, variable v, expression e) (returns: array)

Evaluates expression a to an array. Then for each array element, binds its value to variable name v, evaluates expression e, and pushes the result onto the result array if the result is truish.

Examples:

• filter(stra, v, v.length == 3): Returns [“Big”]
• filter(numa, v, v < 3): Returns [0, 1, 2]

Since filter is an array-expecting function, we don’t need to parseJson here.

A bigger example: we want expression that returns the number of pages with rank >= 2.

# Here we need parseJson because we access an item of the object
length(filter(parseJson(request_details)["page_data"], v, v.rank >= 2))

# Using object notation (but not for filter)
filter(parseJson(request_details)["page_data"], v, v.rank >= 2).length()

Apply a function

The formula language includes the forEach and forEachIndex functions to return a new array after applying a function to each element (like the Javascript “map”).

The syntax is: forEeach(array a, variable v, expression e) (returns: array)

Evaluates expression a to an array. Then for each array element, binds its value to variable name v, evaluates expression e, and pushes the result onto the result array.

An example use case: if you have an array A ["1", "5", "12"] and want the sum. The sum() function only accepts numbers. We can use forEach this way:

sum(forEach(A, v, v.toNumber()))

# With object notation. Remember: filter and forEach can't use object notation
forEach(A, v, v.toNumber()).sum()

A complete example

Let’s say we want to return the length of the concatenation with ‘–’ of all URLs for which rank is 2

(ie, the length of “/u2–/u3”)

We therefore need to:

• Access the page data array
• Filter the array
• Apply a function to get an array of URLs
• Join the array
• Get its length
# First way to write, using functions
# We choose to use 'with' to reduce a bit the size of the formula

with(parseJson(request_details)["page_data"], pd,

length( join( forEach( filter(pd, f, f.rank == 2), e, e["url"] ) , "--") )

)

# Using object notation:
# Remember: filter and forEach can't use object notation

with(parseJson(request_details)["page_data"], pd,
forEach(
filter(pd, f, f.rank == 2),
e, e.url
)
).join("--").length()

Dates handling

In addition to the vast date handling features of DSS data preparation, formulas add additional date processing facilities.

First of all, a note about typing:

• Date-typed columns are not automatically parsed as “date” types in formulas (i.e., they are kept as strings)
• However, all functions that expect a date argument will automatically try to convert from the regular Date format in DSS (2015-01-02T03:04:05.234Z)

In other words, if your column is already recognized as a valid Date meaning, you should generally not need to use any specific parsing. If you happen to need one, use the asDate() function

Other useful formulas

Use formula to ‘fill blanks from another column’

namealt_name
Z
A1
E2 C3

To fill the “blanks” of “name” with the value of “alt_name”, create a Formula processor in data preparation.

Set the output column to be “name” (to overwrite the “name” column) and use this formula:

if (isBlank(name), alt_name, name)