Along with PySpark and SparkSQL, DSS 2.1 brings support to SparkR, a Spark module built to interact with DataFrames through R. This short article shows you how to do it.
You have access to a 2.1+ version of DSS, with Spark enabled, and a working installation of Spark, version 1.4+ (a very recent of Spark is required to take full advantage of SparkR)
We’ll be using the Titanic dataset (here from a Kaggle contest), so make sure to first create a new DSS dataset and parse it into a suitable format for analysis.
The best way to discover both your dataset and the SparkR API interactively is to use a Jupyter Notebook. From the top navigation bar of DSS, click on Notebook, and select R, pre-filled with “Template:Starter code for processing with SparkR”:
A Notebook shows up. Leveraging the template code, you can quickly get your DSS dataset in a SparkR DataFrame:
Now that your DataFrame is loaded, you can start using the SparkR API to explore it. Similarly to the PySpark API, SparkR provides us with some useful functions:
Also, SparkR has functions to create aggregates:
Make sure of course to regularly check the official documentation to stay current with the latest improvements of the SparkR API.
Assuming you are ready to deploy your SparkR script, let’s switch to the Flow screen and create a new SparkR recipe:
Specify the recipe inputs/outputs, and when in the code editor, copy/paste your R code:
Your recipe is now ready. Just click the Run button and wait for your job to complete:
We’re done for this short intro! SparkR being part of DSS, it is now possible to develop and manage completely Spark-based workflows using the language of your choice.