The Data Science Studio makes it possible to create an end-to-end analytical workflow using APIs and services from different providers.
Think about it. How many times did you have to:
As the Dataiku Data Science Studio (DSS) users encounter this kind of scenario very frequently, I'm going to take you through a complete example about how to create an automated workflow where you'll end up mixing external and internal data.
We are going to show, on a single map, incidents and crimes data in San Francisco, as well as graffittis and businesses locations (because, hey, who knows, there might be some correlations!).
In DSS, I created a new project called "Openness" which will hold all the work. The complete data processing workflow for this project, once completed, will look like this:
This example is interesting because it is very similar to the various use-cases described above:
It sounds like we have it all, so let's dig in the details!
There are basically two ways to get external data into DSS.
The first one is to use HTTP Datasets, which will let you download and work with any CSV or even simple JSON files, as long as they are reachable via a URL.
One example is the "SFPD Incidents" dataset, visible on the left of the diagram above. The Dataset's "Settings" panel just shows what to do: enter the URL of the file, then hit Download. Once the download completed, DSS will sniff its format and let you use it as any other dataset:
The second way is when you need to access to more complex datasets, or to write a custom logic. In this case, one can use a Python recipe, with no input dataset. Let's have a look for instance at the "get graffittis" recipe, still on the left of the diagram:
This recipe is a "no input" recipe but calls external APIs, which makes it a bit special but very powerful when it comes to use external data. Basically, as long as you can write your code in Python, you will be able to access to virtually any kind of data.
What we do here:
At this stage, we have fetched our raw datasets, and we are now ready to move further.
We now need to stack all the datasets together. This objective is to produce a new dataset with the required structure to push to CartoDB.
Without going to much in details, building a map often requires the dataset to have "real" geographic attributes. CartoDB for instance works well with coordinates encoded a Point, as per the Postgis definition. To make sure everything works smoothly, we'll then use a Postgis-enabled PostgreSQL database to create our dataset.
Moving datasets around can be done in several way in DSS, but here we are going to use "Preparation Scripts". These scripts will:
For instance, if we look at the "compute incidents t" recipe, we start from a local text file, create a preparation script interactively, apply it to the whole dataset, and store the output in PostgreSQL:
That's it: a few clicks to enrich and clean your datasets, and one click to push it to a PostgreSQL database. Nice, right?
Now that the datasets are stored in PostgreSQL, the last step is to create the proper data structure for CartoDB. This is done via the last SQL recipe:
This simply stacks together the 3 sources, and more importantly, create a new Point column, stored as a proper geographic attribute (a "geom") using a well-defined geographic projection. This step produces the "map data" dataset, which we'll use to create our map.
Our dataset is ready to be published! So how do we do it? Once again, a Python recipe will do the job:
This one is a bit hackish, but who doesn't like a tool that can be hacked a little bit? :) Schematically, what we do:
The dataset is now ready to be used:
You can now play around and create several visualizations. Interestingly, once a map is created in CartoDB, it can be embedded somewhere else using an iframe. This is great since they can be then be displayed directly within DSS!
The Studio offers the ability to create your own web apps in the "Insights" section. So to display our maps, we just need to create new web apps and copy/paste the CartoDB iframe link inside the body of the app:
Lastly, these Insights can be published to the Pinboard (using the Publish button), which will let your users play with the viz you just created!
That's a wrap! You should now have a better understanding about how to use external services from within DSS.
Beyond being able to produce the visual outputs, using DSS to perform such a project offers several advantages:
If you want to play with the code by yourself, get a free version of DSS in its Community Edition version, download a copy of the project, create a new project in your DSS and import the archive. Note that this won't work out of the box, except if you have a Postgis database available.
Can you think of a comparable use-case in your organization? Feel free to get in touch with us to discuss how the Data Science Studio can help!
Please fill out the form below to receive the success story by email:
How can we come back to you ?