SQL is the most pervasive way to make data analysis queries. However, doing advanced logic, like loops, conditions, ... is often difficult in SQL. There are some options like stored queries, but they require learning new languages.
DSS lets you run SQL queries directly from a Python recipe. This lets you:
In this tutorial, we are going to use this ability to analyse a dataset from the San Francisco airport.
This dataset contains data about the cargo aircraft landing at SFO. You can find more information about it here: https://data.sfgov.org/Transportation/SFO-Passenger-Cargo-Aircraft-Landing-Datasets/zax7-gzc9
Note: this dataset is distributed under the CC-0 license.
You can either start this tutorial from the DSS interface or create a new project to follow this dataset.
If you use the tutorial within DSS, the tutorial project comes with the SFO airport dataset. Else, you'll need to download the dataset from the previous URL.
You'll need a SQL database configured in DSS. We are not going to use very advanced SQL, so any supported SQL database will do.
The dataset records the number of cargo landings and total landed cargo weight at the SFO airport. The dataset contains one record for each month, airline, type and details of aircraft. The line includes the number of aircrafts of this type of this company that landed this month, and the number of records.
The dataset contains data from 2005 to 2015.
What we would like to do is write a recipe to obtain, for each month and airline, a breakdown in several columns of the total landing weight by aircraft manufacturer. In essence, that's a kind of crosstab / pivot table.
To make things a bit more complex, we are not interested in the small aircraft manufacturers (there are more than 15 in the original data). We only want the top 5 manufacturers. And by top, we mean, "with the highest count of landings".
With these constraints, doing that in SQL only would be fairly complex. Let's do it with a little bit of Python calling SQL.
|Month||Airline||Aircraft type||Landings||Total weight|
|201501||United||Total Boeing weight||Total Airbus Weight||Total Mc. Donnel weight|
Your DSS project has the source data in a Filesystem (or Uploaded) dataset. As this is an import from a CSV file, DSS has not automatically typed the input columns: if you go to the settings of the dataset, you'll see the columns declared as string.
We could set the types manually in the dataset, but we could also let the Preparation recipe of DSS do it. Since anyway we need to copy our dataset to a SQL database, using a Prepare recipe instead of a Sync recipe will give us the typing inference for free.
The dataset is actually fairly clean, so we won't need any actual preparation step.
If you look at your preparation recipe data table, you'll see that DSS has colored some cells in red. That's because DSS thinks that the meaning of the IATA columns is a Country (since many values are valid countries). DSS is wrong on this, since not all IATA codes are valid countries.
We could click on the column header and click Change meaning to tell DSS that this is only text. However, note that DSS has already selected string as the storage type (since a Country name can only be stored as string anyway). Fixing the meaning makes for a cleaner recipe but is not strictly mandatory
Run our preparation recipe (Select "Update Schema" when prompted).
Everything is going well, so we now how our properly typed dataset in our SQL database. Let's create a Python recipe, and let's create an output dataset
sfo_pivot in the same SQL connection.
We want to do our processing in two steps:
In order to create the column per manufacturer, we're going to use a small trick: "CASE WHEN".
This SQL construct allows you to create conditional operations. To make a column with the total landing weights of Airbus planes only, here is how we would use it:
For each row, if it is Airbus, we sum the weight, else we sum 0.
So we now have a dataframe withe manufacturers, let's use a small Python loop to generate these pesky case when
To finish, we only need to build the final query, execute it, get a dataframe, and store the result in the output
We can now run and have a look at our output dataset!
The output is exactly what we wanted.
Let's not resist making a chart. Let's do a
Stacked columns chart like this:
Add in "Tooltip" the
airline_count (SUM) column, click on
Operating Airline, and elect to sort by descending airline_count.
We obtain the following chart.
Fairly unsurprisingly, Boeing is very dominant, and most airlines are mono-carrier. However, United Airlines has a non-neglictible Airbus cargo fleet.
In this first version, we executed both queries using "querytodf", meaning that the Python code actually received the whole data in memory, and sent it back to the database for storing the output dataset.
It would be better (and critical in the case of big datasets) that the "final" query be performed fully in-database.
Fortunately, DSS makes that easy. Executing SQL queries in-database and handling the work of dropping/creating table is what the SQL query does. The
SQLExecutor2 class lets you run a query "as-if" it was a SQL Query recipe.
Let's just replace the final 3 lines of the code by:
And re-run. Everything works the same, but now, the data has not been streamed to Python. Everything stayed in the database.