GPU setup

GPU-backed runners host Gazebo Harmonic worlds, GPU-rendered Webots scenes, and any test that exercises CUDA. The platform discovers GPU capability from the runner’s declared capabilities — but the runner only declares what the host can actually serve. This page covers the host-side work.

Supported hardware

Class	Examples	Notes
Data-centre	T4, A10, L4, L40, A100, H100	Recommended for production GPU pools
Workstation	RTX 3090, 4090, 5090, A6000	Fine for dev pools; check driver compatibility
Embedded	Jetson Orin AGX, Orin NX	Supported for ARM64 self-hosted; aarch64 binary

AMD ROCm and Intel GPUs are not in scope for v2.x. File a request if you need them.

Prerequisites on the host

Install the NVIDIA driver

Use 535+ for CUDA 12 workloads. Ubuntu 22.04:

sudo apt install -y nvidia-driver-535-server
sudo reboot
nvidia-smi  # should show your GPU(s)

Install nvidia-container-toolkit

This is what lets Docker pass /dev/nvidia* and CUDA libraries into the test container.

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt update && sudo apt install -y nvidia-container-toolkit

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Smoke-test container GPU access

docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

You should see the same GPU table as on the host.

Declare GPU capability

In runner.yaml:

capabilities:
  ros_distros: [humble]
  sim_engines: [gazebo-harmonic, webots]
  gpu:
    enabled: true
    count: 1            # number of GPUs the runner will offer
    model: "Tesla T4"   # used for substring match against job requirements
    cuda: "12.4"

Reload the runner (rbtk-runner reload or restart the service). The dashboard will show the GPU capability immediately after the next heartbeat.

Verify with doctor

rbtk-runner doctor --gpu

✓ Docker 24.0.7 reachable
✓ NVIDIA driver 535.183.01
✓ nvidia-container-toolkit 1.16.2
✓ GPU 0: Tesla T4 (15109 MiB free), CUDA 12.4
✓ Test container can reach GPU: PASS
✓ Pool capacity declared: 1 × Tesla T4
✓ Routing eligible for: requires_gpu=true, gazebo-harmonic, webots

A red ✗ flags one of the host-side prerequisites; fix it before relying on the runner for GPU jobs.

Multi-GPU pools

Two patterns work.

One runner, multiple GPUs

If you have a 4-GPU box and want the runner to schedule one GPU per job:

capabilities:
  gpu:
    enabled: true
    count: 4
    model: "L40"
    cuda: "12.4"

resources:
  max_concurrent_jobs: 4

The runner sets NVIDIA_VISIBLE_DEVICES per-job so each container sees exactly one GPU. Jobs that declare gpu_count: 2 get two.

One runner per GPU

If you want hard isolation (one process per GPU), run multiple rbtk-runner instances on the same host with disjoint GPU sets:

# Runner A — GPU 0 only
CUDA_VISIBLE_DEVICES=0 rbtk-runner start --config /etc/roboticks/runner-gpu0.yaml

# Runner B — GPU 1 only
CUDA_VISIBLE_DEVICES=1 rbtk-runner start --config /etc/roboticks/runner-gpu1.yaml

Each runner appears as a distinct row in the pool. Heavier on operational overhead, cleaner blast radius.

Routing for GPU jobs

A test config requests GPU like this:

# .roboticks/test.yaml
requires:
  gpu: true
  gpu_model: "T4|L4"   # regex match against capability.model
  cuda: ">=12.0"

The job_router filters to runners whose declared gpu block satisfies all three. If multiple match, it picks the least-loaded one. If none match, the job queues until one comes online (or times out per project policy).

Common pitfalls

Container sees no GPU even though host does

The Docker daemon needs the NVIDIA runtime registered. After nvidia-ctk runtime configure you must restart Docker (sudo systemctl restart docker). Verify with docker info | grep -i runtime.

CUDA mismatch errors at runtime

The host driver must support the container’s CUDA major version. Driver 535 covers CUDA 12.x; for CUDA 13.x you need driver 575+.

Gazebo Harmonic black screen / shader errors

Gazebo needs OpenGL via EGL. Add --gpus all -e __GLX_VENDOR_LIBRARY_NAME=nvidia -e __NV_PRIME_RENDER_OFFLOAD=1 — the runner does this automatically when sim: gazebo-harmonic is declared, but if you override the image, copy these envs into your Dockerfile.

Multiple processes contending for one GPU

Set resources.max_concurrent_jobs: count so the runner does not over-subscribe. Per-GPU memory caps via NVIDIA MIG are out of scope for v2.x.

GPU Setup

GPU setup

Supported hardware

Prerequisites on the host

Declare GPU capability

Verify with doctor

Multi-GPU pools

One runner, multiple GPUs

One runner per GPU

Routing for GPU jobs

Common pitfalls

Next steps

Pool management

Troubleshooting

​GPU setup

​Supported hardware

​Prerequisites on the host

​Declare GPU capability

​Verify with doctor

​Multi-GPU pools

​One runner, multiple GPUs

​One runner per GPU

​Routing for GPU jobs

​Common pitfalls

​Next steps

Pool management

Troubleshooting

GPU setup

Supported hardware

Prerequisites on the host

Declare GPU capability

Verify with doctor

Multi-GPU pools

One runner, multiple GPUs

One runner per GPU

Routing for GPU jobs

Common pitfalls

Next steps