In this series of guides we would like to pass on some experience we gained over several projects. Here we will share some tips and tools to ease collaboration.
After learning how to use the Studio, our users start solving their own problems. For the success of a project, here are tips based on the experience gained through several projects done by Dataiku.
Properly naming your datasets and your recipes is arguably the most important element for collaboration. Good naming helps you recover your previous work, share your work with others, and understand quickly what your colleagues are working on.
The two main objectives are readable and self explanatory names. Keep your names as short as possible, and think of what this element is doing in your flow. Default names are created by appending the name of the operation to the input’s name. This ordered naming scheme has the benefit of being simple, but it quickly becomes unreadable. Try to replace this name with something more self explanatory.
A good method is to focus on what the created dataset will be used for, and find differentiating names, e.g foo_raw, foo_clean. The input is raw data, the output is clean.
The following rules maintain names compatible with all storage connections (SQL dialects, HDFS, Python dataframe columns, etc.):
Optionally, you can adopt prefixes and suffixes for your datasets. (E.g.: foo_t for a dataset in a SQL database, foo_hdfs for a HDFS dataset etc…)
Keep the same tips in mind when naming columns of your datasets, notebooks and projects.
Adopting a good naming scheme avoids a lot of long descriptions and many comments. However comments are still very useful when collaborating with others (note that yourself in six months counts as someone else). There are many places where a few words can be very useful:
A description on the project homepage. You can add links to datasets, recipes, or any element of the project.
A description in the “summary” tab of a dataset. Note this appears on the flow: click the dataset then “details” in the right column
When publishing insights, you can add the link of the corresponding dataset in the description text.
Edit column details to add a short comment.
Using tags extensively in flow helps identifying at a glance the role of each parts of the flow.
You can also tag elements with the name of the person that is responsible for it!
Tag color can be changed (use for instance red tags for important or urgent elements):
Suggestions for good tags:
See “Dashboard & Insights” on this page. On the insights page, one can see all created graphs and webapps, and publish them on a dashboard. Dashboards can also contain webapps, notebooks (esp. the images they generate), datasets and more.
Dashboards are a good way to share findings among the team, and can be used to show a report to a read-only (e.g. manager) user.
Most code input boxes have a button in their top right corner “code samples”, for instance Python recipes or custom python code for a model. Start by exploring the already provided code samples. They are meant as a helper to start when in front of a blank page.
If you find yourself repeatedly writing similar portions of code, consider writing a plugin (big investment, easiest to use even by non-coders), a library, or a code sample (lightest investment). The code snippet can then easily be inserted in other code boxes, and is available for all team members, it’s time saved for everyone!