howto

Prepare recipe - applying a step to multiple columns

June 27, 2017

There will be times when you need to process multiple columns with the same operation. There are a few ways to accomplish this within a visual preparation script.

From the column view

In the Column View, you can select multiple columns and choose an operation from the Actions menu to apply to the selected columns. This creates a new step in the Script, and is essentially a shortcut to manually creating the step. It is especially useful for processors that cannot be extended to multiple columns in the step editor.

For example, you may have several columns that relate to a customer’s home contact information. In order to help differentiate this from their work contact information, you can go to the Column View, select the columns, select Rename from the Actions menu, and then add home_ as a prefix to the columns.

The new step in the script is a Rename processor that renames each column individually.

From the step editor

Some processors can be extended to multiple columns within the step editor. These processors have several modes for specifying the column.

For example, you may want to fill empty cells of a column with the value Unknown. With the mode set to:

  • Single, specify a single column by name, and it fills empty cells in that column. In the example below, home_zip is the selected column.
  • Multiple, specify a list of columns by name, and it fills empty cells in each of those columns. In the example below, home_zip and work zipcode are the selected columns.
  • Pattern, specify a regular expression. The processor operates on all columns that match the pattern. In the example below, the pattern .*zip.* selects home_zip and work zipcode.
  • All, the processor operates on all columns.

Within a Python function

It is also possible to process multiple columns using a Python function. This is especially useful for processors that cannot be extended to multiple columns within the step editor.

In the example below, the process function adds 5 to the integer-valued columns.

import ast
def process(row):
    # In 'row' mode, the process function must return the full row.
    # You may modify the 'row' in place to
    # keep the previous values of the row.
    for i in row.keys():
        try:
            isint = type(ast.literal_eval(row[i])) is int
        except:
            isint = False
        if isint:
            row[i] = int(row[i]) + 5
    return row