concept

Sampled vs. Complete Data

Applies to DSS 4.0 and above | February 23, 2017

Dataiku DSS allows to go swiftly from quick data discovery, interacting with a sample of the data, to steady analysis on the complete data.

Dataset exploration

Immediate feedback on the dataset

When you explore a dataset, the tabular view allows to scroll only a sample. By default, it is set to the first 10,000 records. All the quick statistics shown in the exploration screen are based on that sample.

You can change the sampling size and method easily, but remember that your sample has to fit into memory to get live feedback:

Sample settings controls

Statistics on a column

Click on the header of a column and select “Analysis” in the menu. It opens a modal with instant statistics on the sample. You can also request statistics on the whole data, but this will be costlier computationally:

Setting the sample for computing statistics on a column

Additional information on a dataset

Note that you can also find and define additional statistics on the datasets in the Status tab :

Dataset status tab, displaying metrics

Those statistics are called metrics. Standard ones are builtin and you can add your own!

Data preparation

While you design your preparation script in a Recipe or in a Lab Visual Analysis, you get real time visual feedback of your operations. These feedbacks are computed on samples of your datasets and not the complete data because it would take too much processing power on big data. By default, the first 10,000 records of your dataset are selected for the sample but you can change this sampling easily:

Sample settings controls in a visual analysis

To obtain the results on the full dataset you will have to run the preparation. Within an analysis, this implies clicking on “Deploy script” and then running the resulting prepare recipe in your flow.

For more information about going from an analysis and deploying it as a recipe, refer to the documentation From the lab to the flow page.

Data visualization

Charts can be created in many places in Dataiku DSS.

  • Within a Lab Visual Analysis – to immediately visualize the sample of data, and process it on the fly
  • On top of datasets – to create charts on either samples or on the full data by using an in-database engine.
  • As a dashboard insight – to share charts with dashboard readers. These charts are based on either samples or on the full data if an in-database engine is available.

Charts on a dataset using an in-database engine

On the left of the screen, in the “Sampling and Engine” section, one can set the chart’s underlying data and the engine – when an in-database engine is available, that is when the underlying dataset lies on a database with an efficient engine (SQL, MPP database, Impala).

Sample settings controls on a chart built in-database