Dataset Audit

Provides a recipe to audit and produce a report about the data in a dataset (SQL or HDFS).

This plugin provides a recipe that takes a SQL-based or HDFS-based dataset as input, and outputs an audit of the data in the input dataset.

The output is a dataset with one line per column in the input dataset. For each column, the recipe outputs:

  • Type
  • Cardinality (number of distinct values)
  • Number of missing/empty values
  • Most frequent value and most frequent value count
  • For numerical columns: min, max, avg

The recipe uses in-processing or in-Hadoop processing, as appropriate for the input dataset

A sample output.

Plugin Information

Version 0.0.3
Author Dataiku
Released 2015/11/13
Last updated 2015/11/13
License Apache Software License
Source code Github
Reporting issues Github

More information about the plugin is available in the Github repository

Image by Alan Cleaver – CC BY 2.0

Get the Dataiku Data Sheet

Learn everything you ever wanted to know about Dataiku (but were afraid to ask), including detailed specifications on features and integrations.