Skip to content

Evaluations

Evaluations make every agent-facing primitive measurable. An evaluation suite defines a set of test cases with expected outputs. Running the suite scores each case and produces an aggregate score that you can track over time, set a golden baseline on, and compare against future runs to catch regressions.

Primitives

PrimitiveWhat it is
EvaluationA named suite targeting a lens, workflow, agent, or team. Carries scoring_rules (JSONB) and a set of cases.
Evaluation caseA single input/expected pair with a weight and optional tags. Reusable across runs.
RubricA versioned list of scoring criteria: {name, weight, threshold, operator}. Rubrics are immutable — saving a new version bumps version and marks the prior one is_current=false.
Evaluation runOne execution of the suite. Scored by the lenser against the active rubric. Records status, aggregate score, started_at, completed_at, and which rubric_id was active.
Case resultPer-case outcome: score, output, error, passed. passed is set by the lenser when a rubric's threshold/operator is evaluated.
BaselineA single "golden run" snapshot per evaluation. Shows a dashed reference line on the regression chart; completed runs display a delta vs. baseline.

Creating an evaluation

Open Evaluations in the agent workspace → click New evaluation.

  • Simple mode — choose a rubric preset (binary_pass, scale_1–5, custom) and add cases with a form.
  • Advanced mode — paste raw JSON for scoring_rules and cases.

After creating the evaluation, click Cases to add, edit, or delete individual test cases.

Building a rubric

Expand any evaluation → open the Rubric panel at the bottom.

Each criterion row has:

  • Name — human-readable label (e.g. factual_accuracy)
  • Operator>=, <=, or ==
  • Threshold — numeric score boundary (0–1)
  • Weight — relative importance when computing the aggregate

Click Save as new version to publish. The lenser uses the rubric that was is_current at the moment the run was queued.

Running evaluations

Manual trigger

Click Run on any evaluation card. This queues an evaluation_run with status='queued'. The lenser backend picks it up and populates case results. The UI polls every 5 seconds and updates when results arrive.

Post-workflow trigger

Any time a workflow run completes with status='completed', useTeamRunDispatch calls fn_trigger_post_run_evaluations(workflow_id, team_run_id). This RPC finds all evaluations whose target_type='workflow' and target_id matches the workflow, and queues a run for each.

This means every successful workflow execution is automatically evaluated — no manual intervention needed.

Evaluator agent role

A workflow assignment with assignee_kind='evaluator' designates an AI agent as the evaluator for a workflow. When that agent's workspace completes a workflow run, it is the evaluation agent that acts rather than a standard executor. Set this in the Workflow Assignments drawer by choosing evaluator from the assignee kind selector.

Baseline snapshots

After a run completes, click Baseline next to it in the run history. This sets it as the golden reference. Subsequent runs show a delta badge:

  • +0.045 in green — improved over baseline
  • −0.012 in red — regressed from baseline

The baseline score also renders as a dashed amber reference line on the regression chart.

Regression history chart

When an evaluation has two or more scored runs, the Score history chart appears above the run list. It plots score over time as a line chart with data points. The amber dashed line marks the baseline if one is set.

Use this chart to identify:

  • Regressions after a model or prompt change
  • Improvements after rubric refinement
  • Variance in scoring across otherwise identical runs

Inspecting failures

On any completed run row, click Failures to open the Failed case drawer. It shows every case where passed=false (or score < 1 when passed is null), with:

  • Full input JSON (collapsible)
  • Full expected JSON (collapsible)
  • Actual output JSON (collapsible)
  • Error text if the lenser threw

Use this to diagnose which specific cases are failing and why before editing the prompt or expected output.

DB schema reference

agents.evaluations

ColumnTypeNotes
iduuidPK
owner_lenser_iduuidProfile owning this suite
ai_lenser_iduuidAssociated agent (nullable)
target_typetextlens, workflow, agent, team
target_iduuidID of the target entity
nametextDisplay name
scoring_rulesjsonbLegacy free-form rules; use rubrics for structured criteria
created_attimestamptz

agents.evaluation_rubrics

ColumnTypeNotes
iduuidPK
evaluation_iduuidFK → evaluations
versionintegerMonotonically increasing per evaluation
criteriajsonbArray of {name, weight, threshold, operator}
is_currentbooleanOnly one rubric per evaluation is current at a time
created_attimestamptz

agents.evaluation_baselines

ColumnTypeNotes
iduuidPK
evaluation_iduuidFK → evaluations (UNIQUE — one baseline per eval)
run_iduuidFK → evaluation_runs
scorenumericCaptured score at the time baseline was set
set_attimestamptz
set_byuuidFK → profiles

agents.evaluation_runs

ColumnTypeNotes
iduuidPK
evaluation_iduuidFK → evaluations
rubric_iduuidRubric active at queue time (nullable)
statustextqueued, running, completed, failed, cancelled
scorenumericAggregate score 0–1
started_attimestamptz
completed_attimestamptz

agents.evaluation_case_results

ColumnTypeNotes
iduuidPK
evaluation_run_iduuidFK → evaluation_runs
case_iduuidFK → evaluation_cases
scorenumericPer-case score
outputjsonbActual model output
errortextError message if case failed to execute
passedbooleanSet by lenser based on rubric threshold/operator

RPC reference

fn_run_evaluation(p_evaluation_id uuid, p_model_id uuid DEFAULT NULL)

Queues a new evaluation_run with status='queued'. Authorization: caller must own the evaluation or manage its AI lenser. Returns the new run UUID.

sql
SELECT fn_run_evaluation('eval-uuid', NULL);

fn_trigger_post_run_evaluations(p_workflow_id uuid, p_team_run_id uuid)

Finds all evaluations targeting p_workflow_id (target_type='workflow') and queues a run for each via fn_run_evaluation. Called by useTeamRunDispatch after a workflow run completes. Fire-and-forget on the client side.

sql
SELECT agents.fn_trigger_post_run_evaluations('workflow-uuid', 'team-run-uuid');