Skip to content

EVALUATION.md — Portable eval suite

An EVALUATION declares a reusable quality benchmark: a dataset of cases, a rubric, metrics with pass thresholds, and a judging policy (rubric scoring, judge lenser, or human review). Used by battles and by CI gates.

Filename

  • Canonical: EVALUATION.md
  • Container: evals/<slug>/EVALUATION.md (or eval/)

Required frontmatter

KeyTypeNotes
kindevaluationDiscriminator
schema_versionnumber1
idstringStable id (referenced by lenses via evaluation_refs[])
namestringDisplay name

Common keys: rubric_ref, dataset_ref, metrics[], thresholds, judges.

Required sections

  • # Purpose
  • # Dataset
  • # Metrics
  • # Judging

CLI

bash
lenserfight evaluate ./EVALUATION.md

Canonical template

yaml
---
kind: evaluation
schema_version: 1
id: evaluation_<uuid>
slug: research-quality-eval
name: Research Quality Evaluation
owner: { workspace_id: ws_<uuid> }
visibility: workspace
status: draft
version: 0.1.0
rubric_ref: rubric_research_quality
dataset_ref: dataset_research_cases_v1
metrics: [completeness, citation_quality]
---

# Purpose
What quality signal this evaluation is responsible for.

# Dataset
Describe the cases, fixtures, or benchmark dataset.

# Metrics
Define the metrics, thresholds, and pass conditions.

# Judging
Describe rubric scoring, judge agent use, and human overrides.