howto

Using SparkR in DSS

September 29, 2015

Along with PySpark and SparkSQL, DSS 2.1 brings support to SparkR, a Spark module built to interact with DataFrames through R. This short article shows you how to do it.

Prerequisites

You have access to a 2.1+ version of DSS, with Spark enabled, and a working installation of Spark, version 1.4+ (a very recent of Spark is required to take full advantage of SparkR)

We'll be using the Titanic dataset (here from a Kaggle contest), so make sure to first create a new DSS dataset and parse it into a suitable format for analysis.

Using SparkR interactively in Jupyter Notebooks

The best way to discover both your dataset and the SparkR API interactively is to use a Jupyter Notebook. From the top navigation bar of DSS, click on Notebook, and select R, pre-filled with "Template:Starter code for processing with SparkR":

A Notebook shows up. Leveraging the template code, you can quickly get your DSS dataset in a SparkR DataFrame:

library(SparkR)
library(dataiku)
library(dataiku.spark)

# Initialize SparkR
sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)

# Load DSS dataset into in a Spark dataframe
titanic <- dkuSparkReadDataset(sqlContext, "titanic")

Now that your DataFrame is loaded, you can start using the SparkR API to explore it. Similarly to the PySpark API, SparkR provides us with some useful functions:

# How many records in the dataframe?
nrow(titanic)

# What's the exact schema of the dataframe?
schema(titanic)

# What's the content of the dataframe?
head(titanic)

Also, SparkR has functions to create aggregates:

head(
  summarize(
    groupBy(titanic, titanic$Survived), 
    counts = n(titanic$PassengerId),
    fares = avg(titanic$Fare)
  )
)

Make sure of course to regularly check the official documentation to stay current with the latest improvements of the SparkR API.

Integrating SparkR recipes in your workflow

Assuming you are ready to deploy your SparkR script, let's switch to the Flow screen and create a new SparkR recipe:

Specify the recipe inputs/outputs, and when in the code editor, copy/paste your R code:

library(SparkR)
library(dataiku)
library(dataiku.spark)

# Initialize SparkR
sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)

# Read input datasets
titanic <- dkuSparkReadDataset(sqlContext, "titanic")

# Aggregation
agg <- summarize(
         groupBy(titanic, titanic$Survived), 
         counts = n(titanic$PassengerId),
         fares = avg(titanic$Fare)
       )

# Output datasets
dkuSparkWriteDataset(agg, "titanicr")


Your recipe is now ready. Just click the Run button and wait for your job to complete:

We're done for this short intro! SparkR being part of DSS, it is now possible to develop and manage completely Spark-based workflows using the language of your choice.