FAQ: how do I duplicate a project? Can I apply an entire workflow to new data?
Duplicate a project between two distinct DSS instances
This is the use case we had in mind when designing the export+import feature: we want moving projects from one DSS instance to another to be easy.
To do so:
- in the existing project, go to project home page → settings (or “project administration” for older DSS versions), then action → export the project. You get a .zip file, downloaded by your local browser.
- on the second DSS instance, go to the home page of DSS (the list of project), click “Import project” (or new “project → import” for older versions) and upload this zip file.
If bandwidth between your local client and the DSS server is a problem, there are at least two options:
- uncheck the export of managed datasets and the likes. Once the project is imported on the second DSS instance, rebuild those datasets.
- better, use the command-line export and import scripts available on the DSS server:
DATA_DIR/bin/dsscli project-export project_key foo.zip DATA_DIR/bin/dsscli project-import foo.zip
dsscli export-project -h for a short help.
If there are datasets accessed through a connection (for instance to a SQL database), similar connections must exist on the new DSS instance. The new DSS instance will ask you which connections to use.
The idea is that you first configure the connection in the second instance, with a different HDFS root path or SQL schema (or to a whole different DB). Do not create a connection to the same location (e.g. the same DB) on the second DSS instance: both projects would then write to the same tables (in SQL, Hive, etc.). The last dataset to be computed would overwrite the other dataset stored in the same table.
Duplicate a project on the same DSS instance
To duplicate a project, the recommended way is also to export the project then import it (and choose a new project key during import).
If you have ssh access to the server hosting DSS, you can also do this from the command with the
dku command explained above. This will also avoid the round trip data transfer between DSS and your browser.
Changing the project key while importing should update the relevant settings/path (almost) everywhere. But custom code should be changed manually as usual.
But there is a subtlety to take care of: if you reimport exactly an export in the same project, both projects will use the same connections (since a connection is global to a DSS instance, i.e. it is shared among projects) and target the same location to store their data. Thus, for dataset stored through a connection (SQL, HDFS, S3, a NoSQL DB, etc.), you need to define a new location to store the data:
- either change the table corresponding to each datasets
- or define a new connection, to a new location, and use it for all affected datasets of the new project. For instance, each connection could use a different DB schema.
There is no need to edit the filesystem managed datasets, as filesystem managed datasets from different projects are stored in distinct folders.
To change the connection of a dataset, you can either:
- explore the dataset, click settings → connection
- use the public API. This is the recommended to edit a large number of datasets through scripting.
- if for some reason you cannot use the API, an unofficial way is to edit the json files defining the datasets.
After changing the connection of the datasets, you need to rebuild them (defining a new storage location does not move the data, so after using a new connection, the target location is empty).
When you import the new project, no recipe are run, so no dataset of the old project are overwritten at that point. Of course, be careful to first edit the datasets to choose the right connection, and then only launch the build of a dataset in the new project.
We are currently thinking about ways to improve that, but we don't have definitive plans yet.