Skip to content

Failed Case drawer

Opened from the Evaluations Section or Evaluation drawer when a case fails.

Sections

SectionContent
ExpectedThe assertion's expected value (rendered by type)
ActualThe agent's actual output for this case
DiffInline character-level diff for substring / regex types
Run traceLink to the originating run (opens Run Detail)
Token costPrompt + completion tokens consumed

Triage flow

  1. Read the diff — is the actual output close, or wildly off?
  2. Open the run trace to inspect tool calls and intermediate states.
  3. Decide:
    • Update the case (assertion was wrong).
    • Update the instruction lens (prompt regression).
    • Update the model profile (model regression).
    • Update tooling (tool regression).

Code-backed workflow

Source of truth: FailedCaseDrawer.tsx.

  1. Inspect failed evaluation output, expected result, score, and diff context.
  2. Use it for diagnosis only; fixes belong in cases, rubrics, prompts, models, or workflows.
  3. Verify the fix by rerunning the suite and comparing against the baseline.