howto

Using R's dplyr with DSS

July 03, 2015

Using dplyr and Data Science Studio

If you're like me, you love R and everything in the hadleyverse. It's the perfect language for manipulating, plotting and modeling data. Alas, not all of our colleagues are fluent in R, and we have to be able to share and deploy our R scripts. This is where Data Science Studio comes to the rescue.

I'm going to walk through how to include a R recipe into a DSS project. Don't worry, you can continue to use the dplyr workflow that you love, while making your R scripts accessible to everyone on your team.

Data

I'm using the nycflights13 dataset, which has delays for all flights leaving New York City in 2013. Getting the data into DSS is super easy, check out our documentation if you're unsure how to do this.

Once the dataset is loaded into DSS, you should see something like this with one lonely dataset and no recipes.

flow1

Baby steps: your first R recipe

For the first R recipe, I'm going to do an aggregation to find the average delay for each airline.

To create a new R recipe, click on the dataset that you want to work with (in our case the flights dataset) and then the R symbol on the right side of the screen.

selecting_R1

And then select "Recipe Create new datasets using R code"

selecting_R2

Next, specify the input and output datasets of the R recipe. DSS has already filled in the input of the recipe. I'm going to create a new dataset for the output, which I'm calling airline_delay.

creating_new_dataset

Once we click "Create", DSS should pull up a text editor where we can create our R script. DSS has provided starter code for the recipe that reads in our input dataset and writes the output dataset. We just have to create the code to connect those two datasets.

library(dataiku)

# Input datasets
flights <- read.dataset("flights")

# Output datasets
write.dataset_with_schema(airline_delay,"airline_delay")

I'm going to aggregate the dataset to find the average delay for each carrier, and arrange the data so that the carriers with the worst delays are on top. The code for the recipe goes between the calls to read.dataset() and write.dataset_with_schema(). If this code is confusing, check out Hadley Wickham's amazing dplyr tutorials.

library(dataiku)
library(dplyr)

# Input datasets
flights <- read.dataset("flights")

airline_delay = flights %>%
    group_by(carrier) %>%
    summarize(avg_delay=mean(arr_delay, na.rm=T)) %>%
    arrange(-avg_delay)

# Output datasets
write.dataset_with_schema(airline_delay,"airline_delay")

In the browser, the final script looks like this:

final_script

If we click "run" on the bottom of the page, DSS will create an R process and execute the code. Now, when we go back to our flow we see two datasets connected by our R script.

flow2

Nice! Now we're getting somewhere. We're starting to develop a flow of datasets and the scripts to connect them. You can double click on the airline_delay dataset to see which airlines are the best and worst for flying into NYC.

A more complicated recipe

Let's keep going! For a lot of real-world applications, you're going to have multiple inputs and multiple outputs. Let's enrich our airline data with geolocation information.

I uploaded airports.csv from ourairports.com into DSS so now we have some geolocation information.

flow3

I'm going to create a new R recipe. This time, I'm going to use two inputs (flights and airports) and I'm going to create three outputs (avg_delay_country, avg_delay_region and avg_delay_municipality).

creating_new_dataset2

I'm going to aggregate arrival delays by country, region and municipality. Again, DSS creates some starter code for us

library(dataiku)

# Input datasets
flights <- read.dataset("flights")
airports <- read.dataset("airports")

# Output datasets
write.dataset_with_schema(avg_delay_country,"avg_delay_country")
write.dataset_with_schema(avg_delay_region,"avg_delay_region")
write.dataset_with_schema(avg_delay_municipality,"avg_delay_municipality")

We just have to provide the filler code to go from the inputs to the outputs.

library(dataiku)
library(dplyr)

# Input datasets
flights <- read.dataset("flights")
airports <- read.dataset("airports")

tmp = flights %>%
    mutate(ident=dest) %>%
    inner_join(airports)

# calculate delay by country
avg_delay_country = tmp %>%
    group_by(iso_country) %>%
    summarize(delay=mean(arr_delay, na.rm=T))

# calculate delay by region
avg_delay_region = tmp %>%
    group_by(iso_region) %>%
    summarize(delay=mean(arr_delay, na.rm=T))

# calculate delay by municipality
avg_delay_municipality = tmp %>%
    group_by(municipality) %>%
    summarize(delay=mean(arr_delay, na.rm=T))

# Output datasets
write.dataset_with_schema(avg_delay_country,"avg_delay_country")
write.dataset_with_schema(avg_delay_region,"avg_delay_region")
write.dataset_with_schema(avg_delay_municipality,"avg_delay_municipality")

In the browser, it looks like this.

final_script2.

Now, when we return to the flow, we see our new R recipe. The recipe takes in two inputs and produces three outputs.

flow4

Awesome! Now, we have a network of datasets and R recipes connecting them. Your colleagues, who may not be R aficionados, can now access your work and conduct their own analyses in R, Python, SQL or through the DSS graphical user interface.

If this tutorial was helpful, you can check out our other tutorials for data science 101, machine learning with DSS or building a webapp with DSS