HP Vertica is a high performance analytical database that can be used in conjunction with DSS to create advanced data science workflows.
In this tutorial, we'll help you get started with integrating Vertica in DSS:
- Setting up the connection to Vertica from DSS.
- Loading some data into Vertica
- Use DSS visual recipes to process data fully in-Vertica
- Use DSS code recipes
You need to have DSS installed. We assume that you have already followed the following tutorials:
You need to have access to a Vertica instance, with a database where you can read and write. If you do not have a running Vertica instance, you may want to look at the Community Edition available here, once you have created an account.
Install the JDBC driver
Note: you need to have shell access to the machine running DSS for this step. Please ask your administrator if needed.
The Vertica JDBC driver can be downloaded from My Vertica. Download the JDBC jar file ("All Operating Systems").
Once you have downloaded the JAR file, follow the instructions from our reference documentation to enable the driver.
Define a connection to Vertica
Note: you must be a DSS administrator for this step. Please ask your administrator if needed.
In the Administration menu of DSS, under the Connections tab, click on "New connection" and select "HP Vertica":
Fill in your credentials and database settings as prompted:
Please note that you'll need to know a few information, including your database host, login and password.
Create a dataset from existing data
If you already have some data in your Vertica instances, you can create a new "external" dataset in DSS, that will reference your Vertica table:
- Go to the Flow
- Click the New dataset, select SQL Databases > Vertica
- Click on "Get tables list" to list all tables in your Vertica database
- Select the table you want to use
- Click on "Test table"
- Verify the data that you obtain
- Click on "Create" to actually create your dataset
- You can now explore and start playing with your data.
Load some data into Vertica
For the rest of this tutorial, we are going to use one of our favorite datasets: an open data dataset of crimes investigated by the San Francisco Police Department.
- Download the SFPD data file:
- Upload it to DSS. Please refer to our 101 tutorial if needed.
An important thing to note is that at that point, when you explore the uploaded dataset, you'll notice that even though some columns have numerical meanings, the storage type of all columns is "string". For more information about meanings versus storage types, please see our documentation about meanings and storage types. That is because, when the source is a CSV file (which does not really have a builtin schema), DSS does not automatically try to guess a storage type and leaves the "safe" string type.
Let's fix that:
- Go to the settings of the uploaded datasets
- Go to the "Schema" tab
- Click on "Infer from data button"
DSS automatically selects the most probable storage type given the content of the column. You'll notice that the types of the numerical columns is now properly set.
Save your dataset and head back to the Flow.
- Click on the dataset
- In the right bar, click on Sync. The Sync recipe copies one dataset to another.
- We are now going to create a new output dataset, hosted on Vertica.
- Select your Vertica connection in the right section of the modal, and validate.
Click on the "Run" button to actually execute the Sync. When it's done, click on "Explore output dataset".
The data you are seing now resides on Vertica! Let's start doing something with it.
Using visual recipes
Let's group our data and compute aggregated statistics.
From the Actions menu in the dataset, click on Group. We are going to group our incidents by Category and compute, for each category, the count of incidents and the "average location".
In the grouping recipe screen, enable the count per group and a few aggregations, then click on Run.
Accept the proposed schema change. The recipe now runs in your database: DSS has actually sent commands and the database processes the aggregation itself. You can view your grouped dataset by exploring it.
The grouping recipe (and the other visual recipes) can do much more! Head over to the visual recipes documentation for more information.
Interactive querying with the SQL notebook
Note: this section assumes at least basic familiarity with the SQL language.
The SQL notebook is an interactive environment in which you can run SQL queries and see the results immediately. Use it to work iteratively on your data, to get a grasp of what's in there, or to check the results after building a table in the Flow.
Go to the Notebooks section of your project, and click New Notebook
Select SQL, and your Vertica connection.
You are now on the notebook screen. Click "+ query".
You can start typing your query and press Run to execute it. The query runs and the results appear.
Here are some things that you can do in the SQL notebook:
- Click "Download" to download or export the results
- Click "+Query" to make other queries while keeping the one you already have
- Hit Ctrl+Space for autocompletion while typing the query
- Hit Ctrl+Enter to run the query
- Use the "Tables" tab at the left to see all tables in your project or in your database. Click on a table name or field name to insert it in the query.
We've seen: * how you can create new datasets without coding, using our visual recipes * how you can write SQL to get the results of your queries.
You can also create new datasets using SQL code: use SQL recipes!
- Go back to the Flow and select the Vertica "copy" dataset.
- From the right column, click on "SQL"
- Select "SQL query"
Click on "Set" to create the output dataset, give it a name and click "Create new dataset", then "Create" to actually create your recipe.
In this recipe, you need to enter a SELECT kind of query. DSS will automatically take care of transforming your SELECT into a proper INSERT into the target dataset.
When you're done, click on "Validate". DSS will automatically compute the schema for the output dataset, from your query, and ask you to update the schema of the output dataset.
Accept the schema update, and click Run. Voila, you now have a new dataset (backed by a new table on your Vertica database).
Note: you can also create a SQL query that does not output its data in the database, but in another storage engine! Almost all DSS recipes can go from any kind of dataset to any other kind.
We have only started to scratch what you can do in DSS with Vertica.
For example, because Vertica is built for interactive analysis, you can draw Charts using the "Live engine":
This way, all the Charts can be created on the entire dataset, delegating to Vertica the heavy-weight calculations:
Vertica is a very good match to DSS, allowing users to delegate heavy computations to an analytical backend (without moving too much around the data). This makes also a very good way to distribute data to other applications (Tableau, Qlikview, ...), DSS writing results of complex cleaning, enrichments and modeling in Vertica datasets. You can for instance read the Blablacar's use case available here.