HP Vertica is a high performance analytical database that can be used in conjunction with DSS to create advanced data science workflows.
In this tutorial, we’ll help you get started with integrating Vertica in DSS:
You need to have DSS installed. We assume that you have already followed the following tutorials:
You need to have access to a Vertica instance, with a database where you can read and write. If you do not have a running Vertica instance, you may want to look at the Community Edition available here, once you have created an account.
Note: you need to have shell access to the machine running DSS for this step. Please ask your administrator if needed.
The Vertica JDBC driver can be downloaded from My Vertica. Download the JDBC jar file (“All Operating Systems”).
Once you have downloaded the JAR file, follow the instructions from our reference documentation to enable the driver.
Note: you must be a DSS administrator for this step. Please ask your administrator if needed.
In the Administration menu of DSS, under the Connections tab, click on “New connection” and select “HP Vertica”:
Fill in your credentials and database settings as prompted:
Please note that you’ll need to know a few information, including your database host, login and password.
If you already have some data in your Vertica instances, you can create a new “external” dataset in DSS, that will reference your Vertica table:
For the rest of this tutorial, we are going to use one of our favorite datasets: an open data dataset of crimes investigated by the San Francisco Police Department.
An important thing to note is that at that point, when you explore the uploaded dataset, you’ll notice that even though some columns have numerical meanings, the storage type of all columns is “string”. For more information about meanings versus storage types, please see our documentation about meanings and storage types. That is because, when the source is a CSV file (which does not really have a builtin schema), DSS does not automatically try to guess a storage type and leaves the “safe” string type.
Let’s fix that:
DSS automatically selects the most probable storage type given the content of the column. You’ll notice that the types of the numerical columns is now properly set.
Save your dataset and head back to the Flow.
Click on the “Run” button to actually execute the Sync. When it’s done, click on “Explore output dataset”.
The data you are seing now resides on Vertica! Let’s start doing something with it.
Let’s group our data and compute aggregated statistics.
From the Actions menu in the dataset, click on Group. We are going to group our incidents by Category and compute, for each category, the count of incidents and the “average location”.
In the grouping recipe screen, enable the count per group and a few aggregations, then click on Run.
Accept the proposed schema change. The recipe now runs in your database: DSS has actually sent commands and the database processes the aggregation itself. You can view your grouped dataset by exploring it.
The grouping recipe (and the other visual recipes) can do much more! Head over to the visual recipes documentation for more information.
Note: this section assumes at least basic familiarity with the SQL language.
The SQL notebook is an interactive environment in which you can run SQL queries and see the results immediately. Use it to work iteratively on your data, to get a grasp of what’s in there, or to check the results after building a table in the Flow.
Go to the Notebooks section of your project, and click New Notebook
Select SQL, and your Vertica connection.
You are now on the notebook screen. Click “+ query”.
You can start typing your query and press Run to execute it. The query runs and the results appear.
Here are some things that you can do in the SQL notebook:
We’ve seen: * how you can create new datasets without coding, using our visual recipes * how you can write SQL to get the results of your queries.
You can also create new datasets using SQL code: use SQL recipes!
Click on “Set” to create the output dataset, give it a name and click “Create new dataset”, then “Create” to actually create your recipe.
In this recipe, you need to enter a SELECT kind of query. DSS will automatically take care of transforming your SELECT into a proper INSERT into the target dataset.
When you’re done, click on “Validate”. DSS will automatically compute the schema for the output dataset, from your query, and ask you to update the schema of the output dataset.
Accept the schema update, and click Run. Voila, you now have a new dataset (backed by a new table on your Vertica database).
Note: you can also create a SQL query that does not output its data in the database, but in another storage engine! Almost all DSS recipes can go from any kind of dataset to any other kind.
We have only started to scratch what you can do in DSS with Vertica.
For example, because Vertica is built for interactive analysis, you can draw Charts using the “Live engine”:
This way, all the Charts can be created on the entire dataset, delegating to Vertica the heavy-weight calculations:
Vertica is a very good match to DSS, allowing users to delegate heavy computations to an analytical backend (without moving too much around the data). This makes also a very good way to distribute data to other applications (Tableau, Qlikview, …), DSS writing results of complex cleaning, enrichments and modeling in Vertica datasets. You can for instance read the Blablacar’s use case available here.
Applies to: DSS 3.0 and above