PyTorch

Last updated 9 months ago

From the Matrix

Select the PyTorch Docker image at the Environment step of the job creation wizard. Only the latest version of PyTorch is supported.

From the command line

Use just create job with the following option: --docker-image pytorch-0.4.0-gpu-py36-cuda9.2

just create job single --project <project> \
--datasets <dataset> --docker-image pytorch-0.4.0-gpu-py36-cuda9.2"

If you are using PyTorch as a framework, do not specify pytorch or torchvision in the requirements file as it will cause an error. In the CLI, specify --docker-image pytorch-0.4.0-gpu-py36-cuda9.2 or, in the Matrix, select PyTorch as the framework.

Distributed training is not currently supported in PyTorch

PyTorch and TensorBoard

It is possible to take advantage of TensorBoard even in PyTorch! This repository provides a TensorBoard API for PyTorch (or any other framework).

When using pip, include the following in the requirements.txt file:

tensorboardX

For anaconda, add this to the requirements.yml:

name: clusterone
dependencies:
- tensorboardX

Then use the following snippet in your code:

...# you requirements
from tensorboardX import SummaryWriter
from clusterone import get_logs_path
import os
#Your local path to outputs, locally tensorboard summaries will be saved here
ROOT_PATH_TO_LOCAL_LOGS = os.path.expanduser("~/Documents/pytorch-projects/examples/logs")
... #model definition
if __name__ == "__main__":
writer = SummaryWriter(log_dir = get_logs_path(ROOT_PATH_TO_LOCAL_LOGS))
for batch_index in range(nb_batches):
... # training operation
loss = ... # compute your loss
#only save loss to tensorboard every n batches to not slow down training
if batch_index % 100 == 0:
writer.add_scalar('loss', loss, batch_index)

See the full tensorboard-pytorch API documentation for more information.