Skip to main content

PagerDuty

PagerDuty is for critical events — the ones worth waking someone up. Roboticks routes a deliberately small set of events here.

What gets paged

EventDefault severityWhy it pages
evidence_pack.generation_failed (on a release tag)criticalA release can’t ship audit-ready without it
runner_pool.offline for an air-gapped pool > 15 minutescriticalAir-gapped customers have no fallback pool
runner_pool.offline for any pool > 60 minuteserrorBuild pipeline is blocked
test_run.completed (failed) on the default branch after releaseerrorRegression on a released line
standard.amendment_published for a pinned safety standardwarning (off by default)Compliance review required
Everything else routes via Slack or Email — don’t page on PR-level test failures.

Set up

1

Create a PagerDuty service

In PagerDuty: Services → New service. Pick (or create) an escalation policy. Name it Roboticks Production or similar.
2

Add an Events API v2 integration

On the new service: Integrations → Add integration → Events API v2. Copy the 32-character Integration Key.
3

Wire it into Roboticks

Settings → Integrations → PagerDuty → Add. Paste the Integration Key. Name the connector (e.g., pd-production).
4

Pick which events page

On the same screen, toggle the event types. The defaults match the table above.
5

Send a test event

Hit Send test event. A trigger fires in PagerDuty within a few seconds; resolve manually after verifying.

Severity mapping

Roboticks event severities map to PagerDuty incident urgencies:
Roboticks severityPagerDuty severityDefault urgency
criticalcriticalHigh
errorerrorHigh
warningwarningLow
infoinfoLow (suppressed by default)
Override per-service via PagerDuty’s Service → Notification rules.

Deduplication and auto-resolve

Roboticks uses a deterministic dedup_key so transient flaps don’t open a second incident:
Eventdedup_key form
Runner pool offlinerbtk:pool-offline:{project_slug}:{pool_name}
Evidence pack failurerbtk:evidence-fail:{project_slug}:{release_tag}
Post-release test failurerbtk:postrelease-fail:{project_slug}:{branch}:{test_id}
When the underlying condition clears (pool comes back online, evidence pack succeeds on retry, next post-release run passes), Roboticks sends an event_action: resolve and PagerDuty auto-closes the incident.

Multiple services

You’ll often want separate PagerDuty services per environment:
Roboticks Production   → on-call rota A
Roboticks Staging      → on-call rota B (low urgency only)
In Roboticks, add two PagerDuty connectors and route events by project or by event type. Settings → Integrations → PagerDuty → Routing.

Event payload

{
  "routing_key": "<your-integration-key>",
  "event_action": "trigger",
  "dedup_key": "rbtk:pool-offline:warehouse:onprem-airgapped",
  "payload": {
    "summary": "Air-gapped pool onprem-airgapped offline for 17 minutes",
    "severity": "critical",
    "source": "roboticks/warehouse",
    "component": "runner-pool",
    "group": "platform",
    "class": "runner_pool_offline",
    "custom_details": {
      "project_slug": "warehouse",
      "pool_name": "onprem-airgapped",
      "pool_type": "self-hosted",
      "airgapped": true,
      "last_heartbeat_at": "2026-05-24T11:42:01Z",
      "queued_jobs": 14
    }
  },
  "links": [{
    "href": "https://app.roboticks.io/r/warehouse/pools/onprem-airgapped",
    "text": "View pool in Roboticks"
  }],
  "client": "Roboticks",
  "client_url": "https://app.roboticks.io"
}

Suppression windows

You can suppress paging during planned maintenance:
rbtk admin pd suppress --connector pd-production --until "2026-05-25T03:00Z" --reason "Maintenance"
Or via PagerDuty’s native Maintenance windows on the service.

Troubleshooting

  • Verify the Integration Key is the Events API v2 key (32 chars), not a v1 or REST API key.
  • Check the delivery log in Roboticks at Settings → Integrations → PagerDuty → Delivery log.
Most often a PagerDuty alert-grouping rule on your service is overriding the dedup_key. Set the service’s incident behaviour to “Create alerts and incidents” with grouping “Off — use Roboticks deduplication”.
Roboticks sends event_action: resolve when the condition clears. If incidents stay open, check the delivery log for resolve rows. If they’re being sent but PagerDuty isn’t acting, your service’s notification policy may be sticky — adjust the Auto-resolve setting.

Best practice

  • One service per environment. Don’t mix prod and staging on one rota.
  • Route only the five default event types. Resist the urge to page on PR failures.
  • Add a runbook URL in PagerDuty’s service config — it appears in every incident.
  • Test the rotation end-to-end at least quarterly.