Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.roboticks.io/llms.txt

Use this file to discover all available pages before exploring further.

GPU setup

GPU-backed runners host Gazebo Harmonic worlds, GPU-rendered Webots scenes, and any test that exercises CUDA. The platform discovers GPU capability from the runner’s declared capabilities — but the runner only declares what the host can actually serve. This page covers the host-side work.

Supported hardware

ClassExamplesNotes
Data-centreT4, A10, L4, L40, A100, H100Recommended for production GPU pools
WorkstationRTX 3090, 4090, 5090, A6000Fine for dev pools; check driver compatibility
EmbeddedJetson Orin AGX, Orin NXSupported for ARM64 self-hosted; aarch64 binary
AMD ROCm and Intel GPUs are not in scope for v2.x. File a request if you need them.

Prerequisites on the host

1

Install the NVIDIA driver

Use 535+ for CUDA 12 workloads. Ubuntu 22.04:
sudo apt install -y nvidia-driver-535-server
sudo reboot
nvidia-smi  # should show your GPU(s)
2

Install nvidia-container-toolkit

This is what lets Docker pass /dev/nvidia* and CUDA libraries into the test container.
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt update && sudo apt install -y nvidia-container-toolkit

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
3

Smoke-test container GPU access

docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
You should see the same GPU table as on the host.

Declare GPU capability

In runner.yaml:
capabilities:
  ros_distros: [humble]
  sim_engines: [gazebo-harmonic, webots]
  gpu:
    enabled: true
    count: 1            # number of GPUs the runner will offer
    model: "Tesla T4"   # used for substring match against job requirements
    cuda: "12.4"
Reload the runner (rbtk-runner reload or restart the service). The dashboard will show the GPU capability immediately after the next heartbeat.

Verify with doctor

rbtk-runner doctor --gpu

 Docker 24.0.7 reachable
 NVIDIA driver 535.183.01
 nvidia-container-toolkit 1.16.2
 GPU 0: Tesla T4 (15109 MiB free), CUDA 12.4
 Test container can reach GPU: PASS
 Pool capacity declared: 1 × Tesla T4
 Routing eligible for: requires_gpu=true, gazebo-harmonic, webots
A red flags one of the host-side prerequisites; fix it before relying on the runner for GPU jobs.

Multi-GPU pools

Two patterns work.

One runner, multiple GPUs

If you have a 4-GPU box and want the runner to schedule one GPU per job:
capabilities:
  gpu:
    enabled: true
    count: 4
    model: "L40"
    cuda: "12.4"

resources:
  max_concurrent_jobs: 4
The runner sets NVIDIA_VISIBLE_DEVICES per-job so each container sees exactly one GPU. Jobs that declare gpu_count: 2 get two.

One runner per GPU

If you want hard isolation (one process per GPU), run multiple rbtk-runner instances on the same host with disjoint GPU sets:
# Runner A — GPU 0 only
CUDA_VISIBLE_DEVICES=0 rbtk-runner start --config /etc/roboticks/runner-gpu0.yaml

# Runner B — GPU 1 only
CUDA_VISIBLE_DEVICES=1 rbtk-runner start --config /etc/roboticks/runner-gpu1.yaml
Each runner appears as a distinct row in the pool. Heavier on operational overhead, cleaner blast radius.

Routing for GPU jobs

A test config requests GPU like this:
# .roboticks/test.yaml
requires:
  gpu: true
  gpu_model: "T4|L4"   # regex match against capability.model
  cuda: ">=12.0"
The job_router filters to runners whose declared gpu block satisfies all three. If multiple match, it picks the least-loaded one. If none match, the job queues until one comes online (or times out per project policy).

Common pitfalls

The Docker daemon needs the NVIDIA runtime registered. After nvidia-ctk runtime configure you must restart Docker (sudo systemctl restart docker). Verify with docker info | grep -i runtime.
The host driver must support the container’s CUDA major version. Driver 535 covers CUDA 12.x; for CUDA 13.x you need driver 575+.
Gazebo needs OpenGL via EGL. Add --gpus all -e __GLX_VENDOR_LIBRARY_NAME=nvidia -e __NV_PRIME_RENDER_OFFLOAD=1 — the runner does this automatically when sim: gazebo-harmonic is declared, but if you override the image, copy these envs into your Dockerfile.
Set resources.max_concurrent_jobs: count so the runner does not over-subscribe. Per-GPU memory caps via NVIDIA MIG are out of scope for v2.x.

Next steps

Pool management

Per-pool stats, tagging, draining.

Troubleshooting

Capability mismatch, MCAP upload, version skew.