DSS includes a language to write formulas, much like a spreadsheet.
This How-To provides concrete examples of formulas that can help you in addition to the Reference documentation.
Formulas can be used:
In data preparation, to create new columns, filter rows, or flag rows
More generally, to filter rows in many places in DSS:
Formulas define an expression that is applied row by row.
Assuming that you have a dataset with columns N1 (numeric), N2 (numeric), and S (string), here are a few example formulas:
2 + 2
N1 + N2
min(N1, N2)
# Returns the smaller of N1 and N2replace(S, 'old', 'new')
# Returns the value of S with ‘old’ replaced by ‘new’if (N1 > N2, 'big', 'small')
# Returns big is N1 > N2, small otherwiseWe recommend that you start by reading the introduction paragraphs of the reference documentation before continuing this How-To.
The reference guide has information about typing, arrays, objects, …
DSS lets you manipulate columns containing arrays ([0,1,2]
) or objects ({"firstname": "John", "lastname" :"Smith"}
).
Arrays and objects are represented using JSON notation. The formula language includes a lot of tools to manipulate arrays and objects.
In the following section, we’ll assume we have an input that looks like that:
client_details | request_details | stra | numa |
---|---|---|---|
{"ip_address" : "1.2.3.4", "visitor_id" : "bdc456ef"} |
{"timings" : [1232, 2034, 2351], "page_data": [{"url" :"/u1", "rank": 1}, {"url" : "/u2", "rank" :2}, {"url" : "/u3", "rank" : 3}, ] |
["Big", "Medium", "Small"] |
[0, 1, 2, 5] |
Formulas support accessing array elements and object keys using the traditional Python/Javascript syntax:
array[0]
object["key"]
object.key
(only valid if ‘key’ is a valid identifier, i.e. matches [A-Za-z0-9_]*
)Important notes: array and object columns are not parsed by default in DSS. They are only automatically parsed in functions that use arrays.
However, index and key access are not functions, so we’ll need to use parseJson
to convert our columns to objects.
You’ll find more information in the Reference documentation for formulas
Thus:
parseJson(request_details)["page_data"]
: Extract the page_data arrayparseJson(request_details)["page_data"][1]
: Extract the second element of the page_data arrayparseJson(request_details)["page_data"][1]["url"]
: Extract the URL of the second element of the page_data arrayThe with
control lets you bind a variable to an expression, to simplify an expression. For example, if we wanted to sum the first two ranks, we’d need: parseJson(request_details)["page_data"][0]["rank"] + parseJson(request_details)["page_data"][1]["rank"]
.
But using with
, it simply becomes:
# Alias parseJson(request_details)["page_data"] as "pages"
with(parseJson(request_details)["page_data"], pages,
pages[0]["rank"] + pages[1]["rank"]
)
The filter
control allows you to filter array elements based on a predicate:
The syntax is : filter(array a, variable v, expression e) (returns: array)
Evaluates expression a to an array. Then for each array element, binds its value to variable name v, evaluates expression e, and pushes the result onto the result array if the result is truish.
Examples:
filter(stra, v, v.length == 3)
: Returns [“Big”]filter(numa, v, v < 3)
: Returns [0, 1, 2]Since filter is an array-expecting function, we don’t need to parseJson here.
A bigger example: we want expression that returns the number of pages with rank >= 2.
# Here we need parseJson because we access an item of the object
length(filter(parseJson(request_details)["page_data"], v, v.rank >= 2))
# Using object notation (but not for filter)
filter(parseJson(request_details)["page_data"], v, v.rank >= 2).length()
The formula language includes the forEach
and forEachIndex
functions to return a new array after applying a function to each element (like the Javascript “map”).
The syntax is: forEeach(array a, variable v, expression e) (returns: array)
Evaluates expression a to an array. Then for each array element, binds its value to variable name v, evaluates expression e, and pushes the result onto the result array.
An example use case: if you have an array A ["1", "5", "12"]
and want the sum. The sum()
function only accepts numbers. We can use forEach this way:
sum(forEach(A, v, v.toNumber()))
# With object notation. Remember: filter and forEach can't use object notation
forEach(A, v, v.toNumber()).sum()
Let’s say we want to return the length of the concatenation with ‘–’ of all URLs for which rank is 2
(ie, the length of “/u2–/u3”)
We therefore need to:
# First way to write, using functions
# We choose to use 'with' to reduce a bit the size of the formula
with(parseJson(request_details)["page_data"], pd,
length( join( forEach( filter(pd, f, f.rank == 2), e, e["url"] ) , "--") )
)
# Using object notation:
# Remember: filter and forEach can't use object notation
with(parseJson(request_details)["page_data"], pd,
forEach(
filter(pd, f, f.rank == 2),
e, e.url
)
).join("--").length()
In addition to the vast date handling features of DSS data preparation, formulas add additional date processing facilities.
First of all, a note about typing:
2015-01-02T03:04:05.234Z
)In other words, if your column is already recognized as a valid Date
meaning, you should generally not need to use any specific parsing. If you happen to need one, use the asDate()
function
name | alt_name |
---|---|
Z | |
A1 | |
E2 | C3 |
To fill the “blanks” of “name” with the value of “alt_name”, create a Formula processor in data preparation.
Set the output column to be “name” (to overwrite the “name” column) and use this formula:
if (isBlank(name), alt_name, name)