Evaluations Section
Route: /lenser/<handle>/ag/evaluations
Evaluations let you regression-test an agent. Each evaluation suite runs a list of cases against the current model + instruction binding and emits a pass/fail score. Treat it like unit tests for prompt + model behaviour.
Anatomy
| Concept | Role |
|---|---|
| Evaluation | A named suite with a model binding |
| Case | One input + one expected assertion |
| Run | One execution of the suite, returns per-case pass/fail |
Assertion types
- substring — actual output contains the expected text
- regex — actual output matches the pattern
- JSONPath — JSON output satisfies a path expression
- score >= — judge-model returns a score ≥ threshold
Drawers
- Evaluation drawer — create/run a suite.
- Evaluation Cases drawer — CRUD over the case list.
- Failed Case drawer — read-only diff for one failure.
When to use
- Before promoting a new model profile to default.
- Before rebinding the instruction lens.
- As a CI gate before publishing a workflow to teams.
Code-backed workflow
Source of truth: EvaluationsSection.tsx plus EvaluationDrawer.tsx, EvaluationCasesDrawer.tsx, and FailedCaseDrawer.tsx. The implementation lists suites, runs evaluations, stores rubrics, sets baselines, and opens failed-case review.
- Create an evaluation suite for the lens, workflow, agent, or team you want to protect.
- Add cases before running the suite. Empty suites cannot prove regression safety.
- Run the suite against the current binding and model context.
- Set a baseline only after reviewing the case-level results.
Verification: a queued evaluation should refresh the suite list, create run history, and expose results for the selected run.