howto

Using SparkSQL in DSS

October 01, 2015

SparkSQL is a Spark module built to interact with structured data stored as DataFrames directly using SQL. DSS offers many ways to write or to generate SparkSQL as we are going to see in this article.

Prerequisites

You have access to a 2.1+ version of DSS, with Spark enabled, and a working installation of Spark, version 1.4+.

We'll be using the Diamonds dataset, well know from R ggplot2's users. The CSV is available at this url, so your first task is to create a (potentially remote) DSS dataset from this file, and parse it into a suitable format:

Writing SparkSQL queries in recipes

The most natural way to use SparkSQL is to create a recipe accordingly. From the Flow screen, click on the Diamonds dataset, and select SparkSQL in the right pane:

Create a new output dataset, and write a simple SQL query in the recipe code editor:

SELECT
  carat,
  COUNT(*) AS records,
  AVG(depth) AS depth_avg,
  COUNT(DISTINCT clarity) AS clarity_unique
FROM
  diamonds
GROUP BY
  carat

Hit the green Run button, and wait for the query to finish. That's it! As you may have noticed, DSS takes care of creating the Spark SQL context in the background, as well as loading the dataset into a DataFrame. The DSS user just needs to focus on its business logic and not on the technical details!

Generating SparkSQL from visual recipes

Another handy way to generate SparkSQL is to use for instance a Group visual recipe:

Start by defining your grouping key and agregates by using the visual interface:

Click on Output, and hit the VIEW QUERY button. A modal window containing the SQL will show up:

Confirm that you want to convert to SparkSQL recipe and that you want to update the schema of the output dataset, and you are taken to the code editor of a regular SparkSQL recipe, but filled with the SQL code:

You can press the Run button again and wait for your Spark job to complete.

Writing SparkSQL from PySpark or SparkR

Finally, another way to express its logic using SQL on Spark DataFrames is to go through a PySpark or SparkR script (which also gives the great advantage to be able to write the query programatically).

Create a new Jupyter (Python) notebook, and select a template with PySpark starter code. Type the following code in your notebook:

import dataiku
import dataiku.spark as dkuspark
import pyspark
from pyspark.sql import SQLContext

# Load PySpark
sc = pyspark.SparkContext()
sqlContext = SQLContext(sc)

# Create the Spark dataframe
diamonds = dataiku.Dataset("diamonds")
df = dkuspark.get_dataframe(sqlContext, diamonds)

# Register the SQL table
df.registerTempTable("diamonds_t")

# Run the query and fetch teh results
sqldf = sqlContext.sql("SELECT depth, COUNT(*) FROM diamonds_t GROUP BY depth")
sqldf.show()

Note that this works also if you want to use SparkR:

library(SparkR)
library(dataiku)
library(dataiku.spark)

# Initialize SparkR
sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)

# Read the dataset into a DataFrame
df <- dkuSparkReadDataset(sqlContext, "diamonds")

# Register the SQL table
registerTempTable(df, "diamonds_t")

# Run the query and fetch the results
head(sql(sqlContext, "SELECT depth, COUNT(*) FROM diamonds_t GROUP BY depth"), 20)

Wrap-up

DSS offers several ways to use SparkSQL. This is really interesting as SQL is known from many data analysts, data engineers or data scientists. It gives the user the ability to leverage Spark through a well-adopted interface.