howto

RMarkdown Reports

Applies to DSS 4.1 and above | December 18, 2017

R Markdown is a package to create print-quality documents that incorporate code to produce output, and can be shared on dashboards or delivered in a variety of formats for offline reading.

In this post, we’ll create a simple R Markdown report in Dataiku DSS.

Prerequisites

You will need:

  • The Orders_by_customer dataset. This can be found in the project DSS Tutorials > Automation > Deployment or you can download the data and import it into a new project.
  • A Dataiku instance with R integration set up
  • An R code environment with the ggplot2 and magrittr packages, in addition to the required dplyr and dataiku packages.
  • An installation of pandoc, in order to download reports as PDFs, with the adjustbox, collectbox, ucs, collection-fontsrecommended, and titling components.

Creating a New R Markdown Report

Create a new empty R Markdown report:

  1. In the top navigation bar, select Lab - Notebooks > R Markdown Reports
  2. Click + New Report
  3. Choose Empty document and type a name for the report

Labs - Notebooks > R Markdown Reports navigation

You will be redirected to the R Markdown editor. ## The R Markdown Editor

The R Markdown editor is divided into two panes.

R Markdown editor; empty

The left pane allows you to see and edit the markdown (including code) underlying the report.

The right pane gives you several views on the report.

  • The Preview tab allows you to write and test your markdown in the left pane while having immediate visual feedback in the right pane. At any time you can save or reload your current markdown by clicking on the Save button.
  • The Log is useful for troubleshooting problems.
  • Settings allows you to set the output format of the preview. You can also set the code environment for this web app, if you want it to be different from the project default.

Writing an R Markdown Report

Let’s build the markdown and code behind the report.

Defining the Document Metadata

In the left pane, insert the following code to define document properties, including the title, author name, date the report was generated, and how to handle certain types of output.

---
title: "Haiku T-Shirt Analytics"
author: "Dataiku Learn"
date: "`r format(Sys.Date())`"
output:
    pdf_document:
        toc: true
---

Note that: - The three dashes, ---. demarcate the beginning and end of the document metadata - The report date specification uses R code to insert the current system date - When generating PDF output for this report, it should include a table of contents

Importing the Necessary Packages

Now, in the left pane, insert the following code to import the R packages that will be used to generate report output.

```{r echo=FALSE, warning=FALSE, message=FALSE}
# Pull the necessary libraries
library(dataiku)
library(magrittr)
library(ggplot2)
library(dplyr)
```
  • The three backticks demarcate the beginning and end of a code block. This block is using R code, the echo parameter indicates that the code itself should not be included in the output, and the warning and message parameters indicate that these types of output should not be included in the report.
  • We are using the ggplot2, dplyr, magrittr, and dataiku R libraries

Report Introduction and Data Import

In the left pane, insert the following code.

```{r echo=FALSE, warning=FALSE, message=FALSE}
# Read the Dataiku dataset we want to use
df <- dkuReadDataset("Orders_by_customer", samplingMethod="head", nbRows=1000000)
```

This report is prepared for the executives of the Haiku T-Shirt company to apprise them of the current state of customer analytics.
  • We use the dkuReadDataset() method to read the Orders_by_customer dataset in the same way we would in an R code recipe
  • Outside of the code blocks, text forms the body of the report

Basic Reporting on Customer Location

In the left pane, insert the following code.

# Customers by Country

The following bar chart shows that:

- the United States is our largest market
- the agglomeration of all other countries where we have fewer than 100 customers accounts for more business than any other single market
- China is the next largest market

```{r echo=FALSE, warning=FALSE, message=FALSE}
df %>%
    count(ip_address_country) %>%
    filter(n>=100) -> country_count

df %>%
    count(ip_address_country) %>%
    filter(n<100) %>%
    summarize(ip_address_country="Others",n=sum(n))%>%
    bind_rows(country_count) -> country_count

country_count$ip_address_country[is.na(country_count$ip_address_country)] <- "Unknown"
country_count$ip_address_country <- factor(country_count$ip_address_country, levels=country_count$ip_address_country[order(country_count$n)])

country_count %>%
    ggplot(aes(ip_address_country,n,fill=n)) + geom_bar(stat="identity") + coord_flip()

```
  • The hashtag, #, is a markdown indication for a new heading
  • The text that explains the chart uses the - markdown to create a bulleted list
  • The R code produces the chart in several steps:
df %>%
    count(ip_address_country) %>%
    filter(n>=100) -> country_count

– Processes the raw data frame to count the number of customers in each country, filtering out all countries with fewer than 100 customers, and saving to a country_count data frame

df %>%
    count(ip_address_country) %>%
    filter(n<100) %>%
    summarize(ip_address_country="Others",n=sum(n))%>%
    bind_rows(country_count) -> country_count

– Counts the total number of customers across countries with fewer than 100 customers each, and adds them as an extra row in the country_count data frame

country_count$ip_address_country[is.na(country_count$ip_address_country)] <- "Unknown"
country_count$ip_address_country <- factor(country_count$ip_address_country, levels=country_count$ip_address_country[order(country_count$n)])

– Recodes the NA values for customers whose country is unknown to the string “Unknown” – Reorders the factor levels of the column ip_address_country so that they are organized in descending order from the country with the most customers to the one with the least

country_count %>%
    ggplot(aes(ip_address_country,n,fill=n)) + geom_bar(stat="identity") + coord_flip()

– Creates the bar chart of number of customers per country, with the coordinate axis flipped so that the bars are horizontal rather than vertical

Reporting on Customer Lifetime Spending

In the left pane, insert the following markdown and code.

# Customer Lifetime Spending

A quick look at the amount spent by customers shows that those targeted by the company's marketing campaign tend to spend much more than those who aren't.  There does not appear to be a significant difference between genders.

```{r echo=FALSE, warning=FALSE, message=FALSE}
df %>%
    ggplot(aes(campaign, total_sum,fill=gender)) + geom_bar(stat="summary",fun.y="mean",position="dodge") +
    scale_y_continuous(name="Customer lifetime spending")
```

The R code in this section produces another bar chart, showing the total amount spent by customers, broken down by gender and whether they are part of the company’s marketing campaign.

R Markdown report

Publishing an R Markdown Report

When you are done with editing, there are a number of options for distributing your report.

  • Publish on a dashboard from the Actions dropdown at the top-right corner of the screen.
  • Download to your local filesystem in one of a variety of formats, again from the Actions dropdown
  • Email as part of an automation scenario

What’s Next

Using Dataiku DSS, you have created an R Markdown report.

You can examine a completed version of this report on the Dataiku gallery.

For further inspiration on what is possible in R Markdown reports, see the R Markdown gallery (external).