Out of the numerous ways to interact with Spark, the DataFrames API, introduced back in Spark 1.3, offers a very convenient way to do data science on Spark using Python (thanks to the PySpark module), as it emulates several functions from the widely used Pandas package. Let's see how to do that in DSS in the short article below.
You have access to a 2.1+ version of DSS, with Spark enabled, and a working installation of Spark, version 1.4+.
We'll use the MovieLens 1M dataset, made of 3 parts: ratings, movies and users. You can start with downloading and creating these datasets in DSS, and parse them using a Visual Data Preparation script to make them suitable for analysis:
You should end up with 3 datasets:
As with regular Python, one can use Jupyter, directly embedded in DSS, to analyze interactively its datasets.
Go to the Notebook section from DSS top navbar, click New Notebook, and choose Python. In the new modal window showing up, select Template: Starter code for processing with PySpark:
You are taken to a new Jupyter notebook, conveniently filled with starter code:
Let's start with loading our DSS datasets into an iteractive PySpark session, and store them in DataFrames.
DSS does the heavy lifting in terms of "plumbing", so loading the Datasets into DataFrames is as easy as typing the following lines of code in your Jupyter notebook:
Let's use some `
Python code to do ./bin/dss start
The hardest part is done. You can now start using your DataFrames using the regular Spark API.
A Spark DataFrame has several interesting methods to uncover their content. For instance, let's have a look at the number of records in each dataset:
DataFrame users has 6040 records
DataFrame movies has 3883 records
DataFrame ratings has 1000209 records
You can also want to look at the actual content of your dataset using the .show() method:
You may want to only check the column names in your DataFrame using the columns attribute:
The printSchema() method gives more details about the DataFrame's schema and structure:
Note that the DataFrame schema directly inherits from the DSS Dataset schema, which comes very handy when you need to manage centrally your datasets and the associated metadata!
Let's have a look now at more advanced functions. Let's start with merging the datasets together to offer a consolidated view.
Let's assume you need to rescale the users ratings by removing their average rating value:
How do the rescaled ratings differ by occupation code? Don't forget that you are in regular Jupyter / Python session, meaning that you can use non-Spark functionalities to analyze your data.
Finally, once your interactive session is over and you are happy with the results, you may want to automate your workflow.
First, download your Jupyter notebook as a regular Python file on your local computer, from the File => Download as... function in the Notebook menu.
Go back to the Flow screen, left click on the ratings dataset, and in the right pane, choose PySpark:
Select the 3 MovieLens datasets as inputs, and create a new dataset called agregates on the machine filesystem:
In the recipe code editor, copy/paste the content of the downloaded Python file, and add the output dataset:
Hit the Run green button. The Spark jobs launches, and successfully completes (check your job's logs to make sure everything went fine).
Your flow is now complete:
Using PySpark and the Spark's DataFrame API in DSS is really easy. This opens up great opportunities for data science in Spark, and create large-scale complex analytical workflows. You can also check this exhaustive post from Databricks, or drop us an email if you want to do more with your data using Spark and DSS.