FAQ: how do I duplicate a project? Can I apply an entire workflow to new data?
This is the use case we had in mind when designing the export+import feature: we want moving projects from one DSS instance to another to be easy.
To do so:
If bandwidth between your local client and the DSS server is a problem, there are at least two options:
DATA_DIR/bin/dsscli project-export project_key foo.zip DATA_DIR/bin/dsscli project-import foo.zip
dsscli export-project -h for a short help.
If there are datasets accessed through a connection (for instance to a SQL database), similar connections must exist on the new DSS instance. The new DSS instance will ask you which connections to use.
The idea is that you first configure the connection in the second instance, with a different HDFS root path or SQL schema (or to a whole different DB). Do not create a connection to the same location (e.g. the same DB) on the second DSS instance: both projects would then write to the same tables (in SQL, Hive, etc.). The last dataset to be computed would overwrite the other dataset stored in the same table.
To duplicate a project, the recommended way is also to export the project then import it (and choose a new project key during import).
If you have ssh access to the server hosting DSS, you can also do this from the command with the
dku command explained above. This will also avoid the round trip data transfer between DSS and your browser.
Changing the project key while importing should update the relevant settings/path (almost) everywhere. But custom code should be changed manually as usual.
But there is a subtlety to take care of: if you reimport exactly an export in the same project, both projects will use the same connections (since a connection is global to a DSS instance, i.e. it is shared among projects) and target the same location to store their data. Thus, for dataset stored through a connection (SQL, HDFS, S3, a NoSQL DB, etc.), you need to define a new location to store the data:
There is no need to edit the filesystem managed datasets, as filesystem managed datasets from different projects are stored in distinct folders.
To change the connection of a dataset, you can either:
After changing the connection of the datasets, you need to rebuild them (defining a new storage location does not move the data, so after using a new connection, the target location is empty).
When you import the new project, no recipe are run, so no dataset of the old project are overwritten at that point. Of course, be careful to first edit the datasets to choose the right connection, and then only launch the build of a dataset in the new project.
We are currently thinking about ways to improve that, but we don’t have definitive plans yet.