SparkSQL is a Spark module built to interact with structured data stored as DataFrames directly using SQL. DSS offers many ways to write or to generate SparkSQL as we are going to see in this article.
You have access to a 2.1+ version of DSS, with Spark enabled, and a working installation of Spark, version 1.4+.
We’ll be using the Diamonds dataset, well know from R ggplot2’s users. The CSV is available at this url, so your first task is to create a (potentially remote) DSS dataset from this file, and parse it into a suitable format:
The most natural way to use SparkSQL is to create a recipe accordingly. From the Flow screen, click on the Diamonds dataset, and select SparkSQL in the right pane:
Create a new output dataset, and write a simple SQL query in the recipe code editor:
Hit the green Run button, and wait for the query to finish. That’s it! As you may have noticed, DSS takes care of creating the Spark SQL context in the background, as well as loading the dataset into a DataFrame. The DSS user just needs to focus on its business logic and not on the technical details!
Another handy way to generate SparkSQL is to use for instance a Group visual recipe:
Start by defining your grouping key and agregates by using the visual interface:
Click on Output, and hit the VIEW QUERY button. A modal window containing the SQL will show up:
Confirm that you want to convert to SparkSQL recipe and that you want to update the schema of the output dataset, and you are taken to the code editor of a regular SparkSQL recipe, but filled with the SQL code:
You can press the Run button again and wait for your Spark job to complete.
Finally, another way to express its logic using SQL on Spark DataFrames is to go through a PySpark or SparkR script (which also gives the great advantage to be able to write the query programatically).
Create a new Jupyter (Python) notebook, and select a template with PySpark starter code. Type the following code in your notebook:
Note that this works also if you want to use SparkR:
DSS offers several ways to use SparkSQL. This is really interesting as SQL is known from many data analysts, data engineers or data scientists. It gives the user the ability to leverage Spark through a well-adopted interface.