Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.roboticks.io/llms.txt

Use this file to discover all available pages before exploring further.

Troubleshooting

Most runner issues fall into one of five buckets. Start with rbtk-runner doctor, then walk the list below.

Runner never picks up a job

A runner that heartbeats ONLINE but never receives jobs is almost always a capability mismatch.

Diagnose

# Get the job's required capabilities
rbtk test inspect <test-run-id>
# requires: { ros: humble, sim: gazebo-harmonic, gpu: true }

# Get the runner's declared capabilities
rbtk pool runners --pool prod-gpu-farm
# gpu-host-04: ros:humble,iron · sim:gazebo · gpu:T4   ← OK
# cpu-host-02: ros:humble        · sim:none  · gpu:no  ← skipped (no sim, no gpu)

Common mismatches

Job needsRunner declaresFix
ros: humbleros_distros: [iron] onlyInstall Humble; add humble to runner.yaml
sim: gazebo-harmonicsim_engines: [gazebo-classic]Install Harmonic; switch the declaration
gpu: truegpu.enabled: falseSee GPU setup
label ldra-licensedlabel not presentAdd the label to the runner’s capabilities.labels
Reload runner.yaml or restart the service after changes.

MCAP upload fails

The runner uploads MCAP files via S3 presigned URLs that the platform mints just-in-time.

Symptom

[error] mcap upload failed: 403 SignatureDoesNotMatch
[error] mcap upload failed: 403 The request signature we calculated does not match

Causes and fixes

SymptomCauseFix
403 SignatureDoesNotMatch immediately on first chunkHost clock skew > 5 minutessudo timedatectl set-ntp true; verify with timedatectl status
403 Request has expired mid-uploadMCAP > 5 GB and upload slower than the 1-hour presign TTLSwitch to multipart upload (enabled by default in v2.2+); upgrade if pinned older
connection refused to S3 hostOutbound firewall blocking S3 endpointAllow *.s3.amazonaws.com and *.s3.<region>.amazonaws.com. Air-gapped: allow your on-prem object store
407 Proxy authentication requiredCorporate proxy in the pathSet HTTPS_PROXY and NO_PROXY=api.roboticks.io,s3.internal in the systemd unit

Heartbeat lapse

The platform marks a runner OFFLINE after 60 s without a heartbeat.

Diagnose

# On the runner host
journalctl -u rbtk-runner --since "10 minutes ago" | grep -E '(heartbeat|error)'

# Network reachability
curl -fsS https://api.roboticks.io/healthz && echo OK

Common causes

CauseSymptomFix
Runner token revoked401 unauthorized on heartbeatRe-register with a fresh registration token
Platform unreachableconnection timed outCheck firewall, DNS, proxy
Process killed (OOM)dmesg shows Out of memory: Killed process rbtk-runnerLower max_concurrent_jobs or add memory
Clock skewHeartbeat 401s with clock skew detectedSync NTP
Docker daemon downLogs show cannot connect to docker daemonsudo systemctl start docker

Version skew with the platform

rbtk-runner v2.x is wire-compatible with the latest platform. If the platform advances a wire-contract minor version, the runner emits a deprecation warning, and after 90 days a hard error.

Diagnose

rbtk-runner status

Version           v2.1.0
Latest stable     v2.4.1
Platform wire     v2.4.x (incompatible with runner v2.1.0 upgrade by 2026-06-30)

Fix

Upgrade via the install path you used originally:
# One-liner reinstall
curl -fsSL https://get.roboticks.io/runner | bash

# Homebrew
brew upgrade rbtk-runner

# Chocolatey
choco upgrade rbtk-runner
Then drain and replace — see Run as a service.

Registration token issues

$ rbtk-runner register --token rbtk_pool_reg_xx...
Error: registration failed: token expired or already used
Registration tokens are single-use and live for 1 hour. Mint a new one:
rbtk pool register-runner --project warehouse --pool prod-gpu-farm
If you need to register many runners (e.g., autoscaling), pre-mint a batch with --ttl 24h and --uses 50 — these flags require Enterprise tier.

Job timeouts

A job killed at the job_timeout boundary surfaces as failed with reason runner_timeout.

Diagnose

rbtk test logs <test-run-id> --tail 100
# ... last visible test stdout ...
[runner] timeout after 30m, sending SIGTERM
[runner] container did not exit cleanly, sending SIGKILL

Fix

Increase resources.job_timeout in runner.yaml:
resources:
  job_timeout: 90m
Or override per-test with timeout: in .roboticks/test.yaml.

Docker permission denied

[error] failed to start container: permission denied while trying to connect to the Docker daemon socket
The runner user must be in the docker group:
sudo usermod -aG docker rbtk
sudo systemctl restart rbtk-runner

Disk space exhaustion

The runner cleans up containers and the work-dir between jobs, but it does not prune Docker images. Periodically:
docker system prune -af --filter "until=168h"
Or wire it into a weekly systemd timer.

Getting more signal

Crank logs to debug and re-run:
ROBOTICKS_LOG_LEVEL=debug rbtk-runner start
debug includes every HTTP request to the platform, the full docker run command for each job, and S3 multipart upload chunk timings.

Still stuck?

Configuration

Capabilities, resource limits, log level.

GPU setup

NVIDIA driver and container-toolkit issues.