Get Started

Last updated 2 months ago

This guide will get you started with the Clusterone SaaS platform at clusterone.com. If you're working with an enterprise installation, see this page.

In the coming minutes, we'll walk you through setting up your account, linking your code and data, and training your model.

To keep things simple, we'll show you how to use Clusterone using a ready-to-run demo of a self-driving car simulation.

What is Clusterone?

Clusterone is deep learning platform that allows you to train your models on distributed GPUs and CPUs without setup or maintenance. Think of it as the operating system for deep learning. Clusterone runs in the cloud, in on-premise installations, or even a combination of the two. We offer a SaaS platform as well as dedicated enterprise installations.

Set up

Before we begin, make sure you have your gear ready:

  • A Clusterone account. Join the waitlist if you don't have one yet.

  • A GitHub account. You can register here.

  • ‚ÄčPython 2.7 or 3.5+

  • The Clusterone Python package. Install it with pip install clusterone.

The Clusterone command line interface, called just, is installed automatically with the Clusterone Python package. Clusterone also provides a graphical web interface, the Matrix.

Linking your GitHub account allows you to access GitHub repositories from within Clusterone. To do this, you need to create a GitHub access token and add it to your Clusterone account.

On GitHub

Log into your GitHub account and navigate to the Personal Access Tokens page in the developer settings. Generate a new token and grant it the repo and admin:repo_hook permissions:

Copy the token when it's created.

On Clusterone

Log into your Clusterone account and open the Matrix. On the Account page, select the Keys tab. Click the Add GitHub OAuth Token button and paste the access token you created above. Click Save to store the token.

Perfect, you have successfully linked your GitHub account to Clusterone.

For more information on linking GitHub to Clusterone, see here.

Create a Project and Run Code

To create a GitHub project after adding your GitHub token, you sometimes need to refresh the page. We are aware of this issue and currently working on fixing it!

Create a project

Log into the Matrix and toggle the switch on the left to show your projects. Click the Add Project button and select Link GitHub Repository in the wizard:

On the next screen, type clusterone/self-driving-demo to find the repository. Click the button at the bottom right to create the project.

To learn more about other ways to create a project, see here.

Create a dataset

For the self-driving car example, you don't have to worry about creating a dataset. We've already uploaded the data for you.

To learn more about how to use data with Clusterone, see here.

Create a job and run it

Open a command line and log into your Clusterone account:

just login

If this command fails or just isn't recognized by your command line, make sure the Clusterone Python package is installed and has been added to your PATH.

Next, create a job:

just create job distributed --project self-driving-demo --module main_tf \
--datasets tensorbot/self-driving-demo-data --ps-type c4.2xlarge \
--worker-type c4.2xlarge --name first-job

It's possible that the just create job command fails when running it shortly after creating the project. It may take up to a few minutes to retrieve all data from GitHub. If your just create job command fails, just grab a coffee or tea and try again.

Let's go over the parameters:

  • Here we are creating a distributed job, meaning that we're using multiple GPUs in parallel. If you'd rather run your code on a single machine, use just create job single ... instead.

  • The --project parameter determines the project you want to run code from. You can only run code from one project per job.

  • The --dataset parameter can accept multiple datasets if your code uses them. You can also omit it if you don't want to use any dataset.

  • The --module parameter is used to define which Python file Clusterone should execute. If this parameter is not provided, Clusterone assumes the file is called main.py. In the self-driving car example, our module is called main_tf.py, so we have to set --module main_tf.

  • Clusterone offers a variety of different machines to run jobs on. The type of machine is defined by the --ps-type and --worker-type parameters for parameter servers and worker machines respectively. In case you want to run on a single machine, use the --instance-type parameter. See here for a list of available instance types.

  • The --name parameter is used to give the job a name. Use this name to refer to the job in the just start job command below.

Finally, all that's left to do is starting the job:

just start job -p self-driving-demo/first-job

The -p parameter determines which job to start.

View your Job and Results

As soon as the job is started, it will gather the necessary resources and run once all resources are available.

Follow Job Progress

You can follow the progress of your job on the Matrix. Click the "See Details" button under the name of your job to see how it's doing.

The "Events" tab provides a graphical representation of the startup progress of the job. Four circles allow you to see at a glance if your job has gathered all the resources it needs, or what is still missing.

  • The Creation Status tells you if the job has been created.

  • The Computational Requirements circle lists all required workers and parameter servers. It also contains information if the workers are running or if your job is still waiting for workers to become available.

  • The Code Cloning circle tell you if the repository code has been successfully cloned onto the worker machines.

  • The Process Start-Up circle represents the overall status of the job. Once the job has started, it will say "Running".

The "Outputs" tab contains a list of all raw output files that are generated while running the job. Here you can find the log files for each worker, event logs, and more. Click on each file to open it, or follow the download link on the right to download the file.

Connect to TensorBoard

Clusterone provides direct access to TensorBoard, TensorFlow's suite of visualization tools.

To add your running job to TensorBoard, click the "Add to TensorBoard" button. Your job is now available on TensorBoard.

To access TensorBoard, click the TensorBoard button on the top bar. You can observe how well the model trains using the Training_Loss and Validation_Loss curves on the "Scalars" page of TensorBoard.

You can further examine a graph representation of the model on the "Graph" page.

Please note that TensorBoard only officially supports Chrome. If you have trouble displaying TensorBoard in Firefox, Safari, or another browser, try using Chrome instead.

Learn More

In this guide, you have learned how to set up your first project on Clusterone, how to run it, and how you can examine its results. What's next?

If you're looking for another use case example, you can follow our DCGAN tutorial. In this more complex example, we run a Deep Convolutional GAN and generate artificial celebrity faces based on the celebA dataset.

If you want to learn more about a specific part of Clusterone, check out our Documentation Homepage with articles on all the details of running state-of-the-art distributed machine learning models on Clusterone.

Or jump right in and run your own project. If you have any comments, questions, or concerns, please don't hesitate to contact us, we'd love to hear from you!

Join our Slack to get support and tips from the community.