Dataiku DSS allows to go swiftly from quick data discovery, interacting with a sample of the data, to steady analysis on the complete data.
When you explore a dataset, the tabular view allows to scroll only a sample. By default, it is set to the first 10,000 records. All the quick statistics shown in the exploration screen are based on that sample.
You can change the sampling size and method easily, but remember that your sample has to fit into memory to get live feedback:
Click on the header of a column and select “Analysis” in the menu. It opens a modal with instant statistics on the sample. You can also request statistics on the whole data, but this will be costlier computationally:
Note that you can also find and define additional statistics on the datasets in the Status tab :
Those statistics are called metrics. Standard ones are builtin and you can add your own!
While you design your preparation script in a Recipe or in a Lab Visual Analysis, you get real time visual feedback of your operations. These feedbacks are computed on samples of your datasets and not the complete data because it would take too much processing power on big data. By default, the first 10,000 records of your dataset are selected for the sample but you can change this sampling easily:
To obtain the results on the full dataset you will have to run the preparation. Within an analysis, this implies clicking on “Deploy script” and then running the resulting prepare recipe in your flow.
For more information about going from an analysis and deploying it as a recipe, refer to the documentation From the lab to the flow page.
Charts can be created in many places in Dataiku DSS.
On the left of the screen, in the “Sampling and Engine” section, one can set the chart’s underlying data and the engine – when an in-database engine is available, that is when the underlying dataset lies on a database with an efficient engine (SQL, MPP database, Impala).