Partitioning data with Vertica

August 24, 2015

HP Vertica is a high performance analytical database that can be used in conjunction with DSS to create complex data science workflows.

In this tutorial, we’ll be using Vertica to process partitioned data. Our documentation contains more information about partitioned data.


You need to already have the DSS and Vertica integration set up.

Please refer to our Howto on that topic.

Some familiarity with DSS and SQL is assumed.

Get some (big) data

In this example, we are going to use Github data. Specifically, we’ll use data from the GitHub Archive, which is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis. These data are very interesting since they provide very detailed and granular events, letting us looking for interesting insights from open source software projects.

We have access here to 6 months of Github activity history (from January to June 2015), stored on our server as JSON files split by hour (one hour == one JSON file).

As usual, the first step is to create a DSS Dataset on top of these files:

Dataset preview of Github activity history

Please note that the selected format is “One record per line”, since we do not want to flatten the JSON file yet due to its very rich nature.

Also, we define a partitioning scheme, allowing us to efficiently work on daily subsets of data:

Defining the partitioning scheme for the dataset

You can now save the input dataset (named “github” here):

Saving the input dataset

Note that you see here the raw JSON content, with one event per line, stored in a fairly large 30Gb dataset (compressed) with 90 millions records for this 6 months period. Also, the column has automatically be named “line”.

Load your data in Vertica

Loading data into Vertica is pretty simple. If you were to make a simple copy of an input dataset to a Vertica dataset (i.e table), you would be able to use a “Sync” recipe. In our case, the input data is a JSON file with a complex nested structure which does not fit properly in a table.

We can transform the base data using a visual preparation script, available after having created a new “Analysis” on the dataset. This script will mostly make use the JSON Flatteners processors to extract the information we are looking for:

  • the type and time of event
  • who did the event
  • on which Github repo the event occurred

Prepare recipe script to transform the data

We end up with a dataset made of 9 columns, ready to be pushed to Vertica. Click on “Deploy script” (the yellow button), and create a new dataset (called “github_t” here) that will be stored inside your Vertica connection (i.e in a Vertica table):

Deploying the Prepare recipe

You will be taken to a summary screen, where you can see the input dataset, the output dataset, and an option to actually build this output dataset (the “Run” button). Note that the partitionning scheme has been automatically copied to the output dataset: the resulting Vertica table will be partitionned.
We’ll load 3 months worth of data for now, so specify the dates you want to build using the “Edit run options” close to the “Run” button, and the calendar showing up. Once done, click on “Run”:

Choosing the start and end dates for data to load

The Job starts. Click on “view details” in the blue bar, and you will be taken to the Jobs menu, where you will be able to see the progress of your task as well as potential errors:

Job details for the S3 dataset build

After a few minutes (varying based on your servers infrastructure), the Job has completed, and the data has been loaded into Vertica.

Analyze your data

One of the very first thing you can do is simply browse your dataset. Once your job completed, click on “Explore output dataset” on the top right of the screen and you are taken to the dataset view.

If you click on the “Status” tab, you will be able to check for the number of records in your datasets, as well as per (daily) partition:

Examining the partitions in the dataset

We are done. Vertica is a very good match to DSS, allowing users to delegate heavy computations to an analytical backend (without moving too much around the data).

Partitioning makes it easy to process data incrementally. Please refer to our reference documentation for more information about partitioning.