Amazon Redshift is one of the analytical database DSS can easily work with. This article shows how to integrate these 2 technologies.
We’ll assume here that you have access to a Redshift instance (otherwise see the docs on how to create one from your Amazon AWS console), and that you have access to a S3 bucket with the proper “write” privileges.
Even if this article is quite long, mostly because of the complex nature of the dataset used, loading data into Redshift with DSS and working with it is as simple as:
That’s it. DSS will take care of the plumbing for you and let you focus on analyzing your data.
Please read the steps below for a more complete description.
Go to the Administration interface in DSS, and create a new “Redshift” connection:
Fill in your credentials as prompted:
Please note that you’ll need to know a few information, including a database password.
From the Administration interface again, and create a new “Amazon AWS (S3)” connection:
Fill in your credentials as prompted:
Please note that you’ll need to know a few information, including your Amazon credentials, and a bucket where you can read and write data.
In this example, we are going to use Github data. Specifically, we’ll use data from the GitHub Archive, which is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis. These data are very interesting since they provide very detailed and granular events, letting us looking for interesting insights from open source software projects.
We have access here to 6 months of Github activity history (from January to June 2015), stored on our server as JSON files split by hour (one hour == one JSON file).
As usual, the first step is to create a DSS Dataset on top of these files:
Please note that the selected format is “One record per line”, since we do not want to flatten the JSON file yet due to its very rich nature.
Also, we define a partitioning scheme, allowing us to efficiently work on daily subsets of data:
You can now save the input dataset (named “github” here):
Note that you see here the raw JSON content, with one event per line, stored in a fairly large 30Gb dataset (compressed) with 90 millions records for this 6 months period. Also, the column has automatically be named “line”.
Loading data effectively into Redshift requires the files to be available on S3, along with a few other constraints well documented here.
For the sake of simplicity, we will not use all the information available in the initial dataset. To create a new dataset with a suitable (tabular) format for Redshift, create a new Analyze script on the initial dataset.
As we are only interested, for now, in looking at the global activity and popularity of the Github repos, the visual data preparation script we build does the following:
This script can now be deployed, and the resulting dataset written on Amazon S3. Click on the “Deploy Script” button, and store the newly created “github_s3” dataset in your S3 connection.
A new recipe has been created, with the local “github” dataset as input, and the “github_s3” dataset as output. Note that the daily partitionning scheme of the input dataset has automatically been copied to the output.
Before actually building your dataset (i.e putting your data on S3), and to comply with the Redshift constraints, click on the link to the new “github_s3” dataset, and go to the “Settings” tab. Change the quoting style to “Escaping only”, and save your changes.
Go back to the Flow screen (click on the button in the nav bar or hit “g + f” on your keyboard), and click on the “github_s3” dataset icon. In the right panel, click on “Build”, and load a few days to begin, the month of January 2015 for instance:
Once everything set, hit the “Build” button. A new job is launched, that will:
With our servers, it takes approximatively 2 minutes to load a month worth of data:
That’s it, your data sits now in S3 and is ready to be loaded in Redshift:
Loading your dataset into Redshift will now be very easy. Just go back to the Flow screen again, and create a new “Sync” recipe from the “github_s3” dataset:
This will create a new “github_t” dataset, stored in your Redshift connection.
The summary screen showing up will let you see again the input / output of your Sync recipe (note that again the partitionning scheme is copied), and build (i.e copy your S3 data to Redshift) your dataset directly from there:
Hit the “Run” button and wait for the job to complete:
A little bit less than 2 minutes to wait, and your data is in Redshift! And note that the resulting table is partitionned, and we didn’t even have to care about its schema.
Where to go from here?
You can for instance easily create Charts on the Redshift dataset. Because Redshift can be used interactively thanks to its performance, Charts can be switched to “Live processing” (i.e Redshift will do most of the aggregation work):
Because live processing is enabled, Charts will be based on the entire dataset:
You may also want to analyse interactively the dataset with your own SQL queries, using a SQL notebook:
And finally, you could also write a SQL recipe in a larger workflow, for instance if you were to use Redshift to aggregate and reduce the size the base data, before passing it to R or Python for deeper analysis, or create a Model from it:
We’re done. A few steps will take your data into Redshift, and you’ll be good for large scale analysis in a matter of minutes.