Data on Clusterone

MK
Last updated 13 days ago

Clusterone provides various different types of data storage. Regardless of the type, data storage is represented by a dataset object in Clusterone. The following datasets are currently supported:

  • Datasets managed using Git (GitHub or Gitlab)

  • Datasets stored in Amazon S3

  • EFS datasets or other NFS mounts (Clusterone Enterprise only)

Regardless of the data source, a dataset has to be created on Clusterone in order to use the data when running a job. Think about a Clusterone dataset as an endpoint that a job can use to access that data.

Comparison

Dataset

Availability

Storage

Upload

Runtime

Versioning

Git

SaaS / Enterprise

GitHub or Gitlab

git push command

Data copied to local storage before job starts

Yes

S3

SaaS / Enterprise

AWS S3

via third-party client

Data copied to local storage before job starts

No

EFS or other NFS mounts

Enterprise

AWS EFS or any NFS storage

via third-party client

Dataset mounted on pods

No

Creating and managing datasets

The process for creating a dataset is similar for all data source types, yet there are some nuances that differ depending on the source.

Check the pages dedicated to each data source for detailed instructions on how to work with these sets.

Use a dataset for running jobs

When creating a job, you can specify any number of datasets to associate with that job. At runtime, the selected datasets will be mounted in /data/<username>/<dataset-name>.

For example, if user jimihendrix has a dataset called guitars, at runtime this dataset will be accessible through the path data/jimihendrix/guitars.

Depending on the storage type, the data will either be mounted, or transferred to the worker's storage before job startup.

Running the same code locally and on Clusterone: get_data_path()

To define the path to your data, use the get_data_path() function from the clusterone Python package. It enables switching from local to a remote environment without changing your code.

For more information on get_data_path(), see the documentation page for the Clusterone Python package.

Sharing datasets

Datasets can be shared with other users, so they can use them in their own jobs.

To share a dataset, click on it in the Matrix and select the share button.

In the dialog, enter the username or email address of the user you want to share the dataset with. You can also add a custom message. Click the Send invites button at the bottom of the form.

The dataset will now automatically appear in the invited user's list of datasets. The user will also receive a notification about the sharing.

Clicking the Show detailed user access link will show information about the users that have access to the dataset. That view also allows for changing the access level for those users as well as revoking their access to the dataset.