AI for test debugging

Five AI surfaces sit on the test side of the product. Each one is a button you press; each one writes its output back into the run’s history and conversation thread so the next engineer doesn’t have to re-prompt.

Test failure triage

Surface: Run-detail page → AI Triage panel on any failed TestJob row. What it ingests:

The failing TestJob’s metadata (name, error message, exit code).
Up to 200 recent log lines from the run.
The runner system log (capped at 50 KB) — the agent-side log that captures the launch failure, the docker pull error, etc.
The test’s @confirms and @tags so the prompt knows what requirements were at stake.

What it returns:

A short summary of the most likely root cause.
Three to five concrete recommendations (rerun with a longer timeout / fix the topic name in assert_topic_published / mark @deadline slack-friendly).
A severity (low | medium | high | critical) used by the dashboard sort.
A confidence score (0.0 – 1.0).
Three follow-up questions you can click to start a conversation.

Endpoint: POST /api/v1/organizations/{org}/projects/{project}/analyze/test-failure. Plan gate: any plan with non-zero ai_tokens_balance.

Test-run analysis

Surface: Run-detail page → “Analyze run” button. Same plumbing as test-failure triage, but the prompt is scoped to the whole run (all failures and skipped tests, plus the run’s overall logs). Useful when more than one test broke and you want a single thread instead of N triage analyses. Endpoint: POST /analyze/test-run with {test_run_id, include_logs, log_limit}.

Conversations

Every analysis can be turned into a follow-up conversation. The platform keeps the original analysis as the system message, lets you and the AI exchange short messages, and tracks token usage per turn against your ai_tokens balance.

POST /conversations with {analysis_id, initial_question}
POST /conversations/{id}/messages to continue
POST /conversations/{id}/archive to close

Each message is a Sonnet call with the conversation history truncated to fit Sonnet’s 15 000-input budget; older turns are summarised back into the system message on demand.

Test flakiness

Surface: Test case detail → “Flakiness analysis” dialog. Available once a test has at least three runs in the project’s history. What it does: takes the last N runs of the same test identity (matched by roboticks.nodeid), bundles their pass/fail outcomes + stderr snippets + duration deltas, and asks Sonnet to classify the flakiness into one of:

Environment / timing — passes in CI but fails locally, or vice versa; tied to a fault-injection rate or a slow sim seed.
State leak — depends on test ordering; cleanup not running.
Genuinely intermittent — race condition, non-deterministic upstream input.
Stable, you’re imagining it.

The output card surfaces the classification, a confidence, the run IDs that drove the verdict, and a recommended next action (mark @flaky, add a retry, fix the cleanup hook, etc.). Task type: TEST_FLAKINESS_ANALYSIS — Sonnet, 30k input budget, 12 token-cost.

Sim vs real comparison

Surface: Sim run detail page → “Compare to real run” action. You pick a real-world run for the same project; the platform pairs the runs by roboticks.nodeid and asks Opus to call out divergences. The prompt builds a side-by-side table of:

Pass/fail outcome per test.
Duration delta.
Topic publish/receive counts from the MCAPs if present.
Fault-injection state if either side used roboticks.fault_injection.

The output card surfaces categories of divergence (e.g. “real run fails on the same obstacle the sim cleared because the sim model lacks /scan jitter”), grouped by likely cause. This is the densest AI surface in the product — Opus, 50 000-token input window, 30 token-cost per call. Task type: SIM_COMPARISON.

Inline log anomalies

Surface: Run-detail logs tab. Lines the platform flagged as anomalous are highlighted with a sidebar marker; click the marker to see a short Haiku-generated explanation. Detection is staged:

Heuristic pass (regex + simple statistical thresholds) marks candidate lines server-side.
Up to ~20 candidate lines per request are batched into a single Haiku call to summarise the anomaly: what’s unusual, what other lines correlate, whether the line is consistent with a known failure mode.

The Haiku call uses task type LOG_SUMMARIZATION — 2 token-cost — so the cost stays linear in lines × runs rather than per-line. Cached results are stored on the AIAnalysis row so re-opening the logs tab is free.

Feedback loop

Every AI card has a thumbs-up / thumbs-down + optional comment. The feedback is stored on the AIAnalysis row (feedback_helpful, feedback_comment, feedback_at) and surfaced in the AI usage admin dashboard. Use it — the platform team retrains prompt templates against negative feedback signal.

What this is NOT

It is not a deterministic test oracle. A failed @confirms is still failed even if the AI is hopeful about it.
It does not modify any code. The “fix” is always a suggestion you implement yourself.
It does not see your private repo unless your test logs include code paths. Sandbox the test environment if log scrubbing matters to you.

Requirements & traceability assists

Move from “why did this test fail” to “what requirement was at stake”.

Evidence & standards

Audit-time AI assists for the engineer-to-auditor handoff.

AI for Test Debugging

AI for test debugging

Test failure triage

Test-run analysis

Conversations

Test flakiness

Sim vs real comparison

Inline log anomalies

Feedback loop

What this is NOT

Next

Requirements & traceability assists

Evidence & standards

​AI for test debugging

​Test failure triage

​Test-run analysis

​Conversations

​Test flakiness

​Sim vs real comparison

​Inline log anomalies

​Feedback loop

​What this is NOT

​Next

Requirements & traceability assists

Evidence & standards

AI for test debugging

Test failure triage

Test-run analysis

Conversations

Test flakiness

Sim vs real comparison

Inline log anomalies

Feedback loop

What this is NOT

Next