howto

Prepare recipe - data exploration

June 27, 2017

The Prepare recipe has many features for exploring a dataset as you enrich and transform it. A typical progression is to gain an understanding of the columns in your dataset, the distribution of values within columns of interest, and then to explore values and patterns of values within the dataset.

Exploring columns

The Columns view is useful when working with many columns. Using filtering and sorting, you can discover columns that are similar and find columns you’re looking for.

You can filter columns in three ways:

  • Text filters as you type to show only columns whose names contain the typed text
  • Meaning filters to show only columns with the selected meaning. Note that when a meaning encompasses a sub-meaning, such as Text, which includes Natural Language, columns with the sub-meaning are included in the filter.
  • Status filters to show only columns with all valid values, with at least one invalid value, or with at least one missing value. This allows you to quickly identify which columns are clean and which columns have issues

You can sort by various criteria, some of which are only appropriate to columns with numeric meaning. It’s generally useful to display the sort criteria in the column under the sort menu.

Any filtering and sorting you apply is cumulative.

The Table view allows you to quickly navigate to a column by typing c and then entering text in the name of the column. The dropdown selection updates as you type to show columns whose name contain the typed text. Additionally, you can display a selection of columns.

Distributions of values in columns

There are two ways to explore the distributions of values in columns:

  • Quick column stats shows histograms for each column and gives you a quick view of the distributions of several columns at once.
  • The Analyze dialog provides greater detail than the quick column stats, along with the ability to take actions based upon your findings. Note: in the Prepare recipe, the summaries provided by the Analyze dialog are always based on the design sample. In order to see results for the whole data, you need to open the Analyze dialog from a dataset, not a recipe.

Exploring values

Using coloring, filtering, and highlighting, you can zero in on values of interest in the Table view.

By default, cells are colored by meaning validity, with red for cells that don’t match the column Meaning, but you can also color by column values.

  • Numeric column values are binned and colored with increasing intensity from low to high values
  • Categorical column values are colored with a different color for each of the most commonly occurring categories, and no color (white) for all other categories.
  • Columns with mostly unique values are shaded light grey for all values

Using a combination of color shading and column selection, you can visually scan for patterns of values across columns of interest.

Filtering values is performed:

  • Globally, using the search bar, or
  • By column

Any coloring, filtering and sorting you apply is cumulative.

When a value is very long, you can select Show complete value, or use the Shift + v shortcut, to display the full cell contents so that it is easier to copy. Note: triple-clicking on a cell also selects the full cell contents, even if the contents are not entirely displayed.

You can also highlight a row of interest by selecting Toggle row highlight, or using the Shift + h shortcut.