You can use Amazon S3 to store datasets and natively import them when running jobs on Clusterone. To use an S3 bucket as a dataset, simply insert your AWS access keys in the access keys section of Matrix. Click here to find out more details on how to connect your AWS account to Clusterone.

S3 datasets on Clusterone correspond to S3 buckets on AWS. Each user can create any number of datasets.

This page provides an overview of how to create a dataset on S3. See here for an overview of other available options for storing datasets.

Create an S3 dataset using an existing S3 bucket

Log into the Matrix and navigate to the Datasets page either by clicking the datasets icon on the left or the blue Datasets field on the dashboard.

Click the Add New Dataset button:

and select Use existing S3 bucket

If you don't see the dataset creation wizard, your AWS account isn't linked correctly. Make sure you follow the instructions to connect your AWS account to Clusterone.

On the next page, type in the name of the S3 bucket you want to link as a dataset. Keep in mind that the AWS account you've connected must have access to this bucket.

Once you're ready, click the Add dataset button to create the dataset

Create an S3 dataset and a new S3 bucket

From the Matrix

Log into the Matrix and navigate to the Datasets page either by clicking the datasets icon on the left or the blue Datasets field on the dashboard.

Click the Add New Dataset button:

and select Create S3 bucket

If you don't see the dataset creation wizard, your AWS account isn't linked correctly. Make sure you follow the instructions to connect your AWS account to Clusterone‚Äč

On the next page, type in the name of the S3 bucket you want to create.

Please note that the name you choose for your bucket needs to be unique throughout all of AWS and has to follow the S3 naming conventions. Take a look here for allowed characters and other rules by AWS. The bucket name:

  • is 3-63 characters long

  • contains only lower-case characters, numbers, periods, and dashes

  • starts with a lowercase letter or number

  • cannot contain underscores, end with a dash, have consecutive periods, or use dashes adjacent to periods

As a general rule, make sure the name is personal and reflects the data that your set contains.

When you're ready, click the Add dataset button to create the dataset and S3 bucket.

From the CLI

To create a new S3 dataset, run:

just create dataset s3 <dataset-name>

Remember that the S3 bucket naming conventions also apply in this case.

Add data to an S3 dataset

You can add and modify your S3 dataset directly through Amazon Web Services.

Please note that the listing of available buckets is disabled for security reasons. When connecting with a client, you will have to specify the bucket you want to access.

Using the AWS CLI

We recommend using the AWS CLI for uploading data to an S3 dataset. Other clients, such as 3Hub, often suffer from much lower uploading speeds.

To upload data to an S3 bucket with the AWS CLI, follow these steps:

  • Install the AWS CLI

  • Run aws configure and input your access key and secret key. See here to learn where to find these keys in your Clusterone account.

  • Run aws s3 cp <source-file> s3://<bucket-name> for file upload or aws s3 cp <source-folder> s3://<bucket-name> --recursive for folder upload.

At Runtime

When running a job, data will be copied from S3 to Clusterone's distributed storage to enable maximum speed when running the job. The job will start once the download is complete. Billing will start after the data is copied and the job has started.

The content of an s3 dataset named <username>/<dataset-name> will be available at runtime in /data/<username>/<dataset-name>.

After the job terminates, data will not persist on the pods, so the changes made to the data will not be saved.

Performance

Downloading from an S3 storage can be slow. In particular, downloading a large collection of small files can slow down the transfer speed. For large S3 datasets, we recommend having a few large files rather than small files.

Clusterone Enterprise offers a variety of storage options. Get in touch via email or chat to us on our Slack, and our solutions team will help you finding the option that is best for you.