Skip to content

Eval Files

Evaluation files define the test cases, graders, workspace lifecycle, and inline runtime block for an evaluation run. Runtime choices such as target matrices, thresholds, budgets, and repeat runs belong under top-level experiment:. Install, build, and reset commands belong under workspace.hooks; runner-specific setup belongs under targets[].hooks. AgentV supports two eval data formats: YAML and JSONL.

YAML is the canonical portable model. TypeScript helpers, generated fixtures, and Python scripts should lower to the same YAML/JSONL shapes rather than inventing a separate eval contract.

Eval YAML is AgentV’s composable and runnable authoring primitive. Use ordinary *.eval.yaml files for direct task suites and for wrapper evals that compose other suites. Raw case files are reusable data inputs, not a second runnable experiment format.

  • A task suite is eval YAML that owns task context: workspace, shared input, shared assertions, and test cases. It can run directly or be imported with type: suite.
  • A raw case file is a YAML/JSONL array, directory, or glob of cases. Import it with tests: ./cases.yaml, string shorthand, or type: tests; parent suite context applies because raw cases do not carry their own suite context.
  • A wrapper eval is eval YAML that imports one or more suites with type: suite and binds runtime policy in its inline experiment: block. Wrapper evals can live anywhere in the repo. A wrapper that imports suites with type: suite must not define parent workspace fields such as workspace, experiment.workspace, or legacy execution.workspace; imported suites own task environment.

For example, a reusable task suite can keep the task contract in one file:

evals/suites/refunds.eval.yaml
suite: refunds
workspace:
repos:
- path: ./support-app
repo: acme/support-app
commit: main
input: Answer using the refund policy in the workspace.
assertions:
- Applies the refund policy correctly
tests:
- id: missing-receipt
input: Can this customer get a refund without a receipt?

Raw cases are just case data:

evals/cases/refund-smoke.cases.yaml
- id: damaged-item
input: The item arrived damaged. What should support do?
expected_output: Offer a replacement or refund path.

A wrapper eval stays ordinary eval YAML while choosing runtime policy:

experiments/refunds-codex.eval.yaml
experiment:
name: refunds-codex
target: codex-gpt5
repeat:
count: 2
strategy: pass_all
tests:
- include: ../evals/suites/refunds.eval.yaml
type: suite
- include: ../evals/cases/refund-smoke.cases.yaml
type: tests

The experiments/ directory in that example is optional and user-owned. AgentV does not infer behavior from the path; the wrapper runs because it is eval YAML with an inline experiment: block. The wrapper owns runtime policy only. Put workspace setup in imported child suites. Parent workspace-affecting fields, including workspace, experiment.workspace, and legacy execution.workspace, are for parent-owned raw cases, including cases imported with type: tests. experiment.workspace is only a runtime mode/path override; repos, hooks, templates, Docker config, and isolation belong in top-level or case-level workspace.

The primary format. A single file contains metadata, inline runtime config, and tests:

description: Math problem solving evaluation
experiment:
target: default
assertions:
- Correctly calculates the answer
- Explains the calculation briefly
tests:
- id: addition
input: What is 15 + 27?
expected_output: "42"
FieldDescription
descriptionHuman-readable description of the evaluation
suiteOptional suite identifier
experimentRuntime policy (target, targets, workers, repeat, threshold, timeout_seconds, budget_usd, etc.)
workspaceSuite-level task environment — inline object or string path to an external workspace file. Repo entries declare identity and checkout pins; acquisition is covered in Workspace Architecture.
testsArray of individual tests, include entries, or a string path to an external file or directory. Tests and include entries may use scoped run: overrides for threshold, repeat, timeout_seconds, and budget_usd.
assertionsSuite-level graders appended to each test unless execution.skip_defaults: true is set on the test
inputSuite-level input messages prepended to each test’s input unless execution.skip_defaults: true is set on the test

workspace is what the agent can inspect or modify through tools, not prompt input. Put instructions in input; put repos, templates, and lifecycle setup in workspace.

For historical or repo-state evals, put the checkout under workspace.repos[].commit or workspace.repos[].base_commit. A commit SHA in the prompt or metadata is useful context, but it does not materialize a repo for the agent to inspect.

You can add structured metadata to your eval file using these optional top-level fields. Metadata is parsed when the name field is present:

FieldDescription
nameMachine-readable identifier (lowercase, hyphens, max 64 chars). Triggers metadata parsing.
descriptionHuman-readable description (max 1024 chars)
versionEval version string (e.g., "1.0")
authorAuthor or team identifier
tagsArray of string tags for categorization
licenseLicense identifier (e.g., "MIT", "Apache-2.0")
requiresDependency constraints (e.g., agentv: ">=0.30.0")
name: export-screening
description: Evaluates export control screening accuracy
version: "1.0"
author: acme-compliance
tags: [compliance, agents]
license: Apache-2.0
requires:
agentv: ">=0.30.0"
tests:
- id: denied-party
criteria: Identifies denied parties correctly
input: Screen "Acme Corp" against denied parties list

The assertions field is the canonical way to define suite-level graders. Suite-level assertions are appended to every test’s graders unless a test sets execution.skip_defaults: true. For semantic or agent-behavior checks, prefer plain assertion strings first; AgentV treats them as rubric criteria. Use deterministic assertions or code graders when the expected output is exact or requires programmatic inspection. If the assertion strings already state the grading contract, omit a duplicate criteria field on each test. Use explicit type: llm-grader entries only when you need a custom prompt, a custom grader target, or a deliberately separate grader panel.

description: API response validation
assertions:
- type: is-json
required: true
- type: contains
value: "status"
- Correctly answers the user's question
- Explains the reasoning clearly
tests:
- id: health-check
input: Check API health

assertions supports rubric shorthand strings, deterministic assertion types (contains, regex, is_json, equals), rubrics, LLM graders, and code graders. See Tests for per-test assertions usage.

Reusable assertion sets can be factored into template files and referenced from any assertions array:

assertions:
- include: safe-response
- include: ./shared/format.yaml

Resolution rules:

  • include: name resolves to .agentv/templates/{name}.yaml with the closest matching directory winning
  • Relative paths resolve from the eval file location, so include: ./shared/format.yaml works as expected
  • Nested includes are allowed up to depth 3 to keep cycles and runaway recursion bounded
  • Suite-level includes follow the same merge behavior as other suite-level assertions and still respect execution.skip_defaults: true

The input field defines messages that are prepended to every test’s input. This avoids repeating the same prompt or system context in each test case — following the same pattern as suite-level assertions.

description: Travel assistant evaluation
input: "Answer as a concise travel assistant."
tests: ./cases.yaml

Use a block scalar for multi-line shared instructions:

input: |
Read AGENTS.md before answering.
Explain the tradeoffs clearly.
tests: ./cases.yaml

Each test in cases.yaml only needs its own query:

- id: japan-spring
criteria: Recommends spring for cherry blossoms
input: When is the best time to visit Japan?

The effective input at runtime becomes [...suite input, ...test input].

Suite-level input accepts the same formats as test-level input:

  • String — wrapped as [{ role: "user", content: "..." }]
  • Object without a top-level role key — wrapped as structured user-message content
  • Single message object — a { role, content } object using a supported message role
  • Message array — used as-is, including system messages and file references

The top-level role key is reserved for message objects. If your structured payload needs a field named role, nest it under another key.

input:
- role: system
content: You are a careful reviewer.
- role: user
content:
- type: file
value: ./system-prompt.md

To opt out for a specific test, set execution.skip_defaults: true (same flag that skips suite-level assertions).

The input_files field provides a shorthand for attaching shared file references to every test. When a test has a string input, the suite-level files are prepended as type: file content blocks in a single user message — the same shape produced by per-test input_files.

description: Schema review evaluation
input_files:
- ./shared-context.md
- ./schema.json
tests:
- id: summarize
criteria: Summarizes the important constraints
input: Summarize the important constraints.
- id: validate
criteria: Identifies validation gaps
input: What validation is missing?

Each test’s effective input becomes a single user message with [file blocks..., text block].

Per-test input_files overrides the suite-level value (it does not merge). To opt out, set execution.skip_defaults: true on the test.

For directory-style evals, a test may omit input and keep the task prompt in Markdown instead. AgentV resolves the prompt in this order:

  1. If the effective input_files contains a file named exactly PROMPT.md, that file becomes the test prompt.
  2. Otherwise, if a PROMPT.md exists beside the EVAL.yaml, that file becomes the test prompt.
  3. Other input_files remain attachments. PROMPT.md is removed from the attachment list so the prompt is not duplicated.
agent-001-fix-bug/
EVAL.yaml
PROMPT.md
fixtures/
failing-test.log
tests:
- id: fix-bug
criteria: Fixes the regression described in the prompt
input_files:
- ./fixtures/failing-test.log

Use explicit input when the prompt is short or generated from YAML variables. Use PROMPT.md when the task text is long enough that duplicating it inside YAML would make the eval hard to review.

Instead of inlining tests in the same file, you can point tests to an external YAML or JSONL file of raw cases. This is the inverse of the sidecar pattern — the metadata file references the test data:

name: my-eval
description: My evaluation suite
experiment:
target: default
tests: ./cases.yaml

The path is resolved relative to the eval file’s directory. The external raw case file should contain a YAML array of test objects or a JSONL file with one test per line. String entries inside a tests: list work the same way and may use direct paths, directories, or globs:

tests:
- ./cases/*.cases.yaml
- include: ./suites/*.eval.yaml
type: suite

String shorthand is raw-case-only. Import reusable task suites with object entries using include: and type: suite; use type: tests when you want to drop suite context and import only raw cases.

When tests points to a directory, AgentV auto-discovers test cases from subdirectories. Each subdirectory containing a case.yaml (or case.yml) becomes a test case:

my-eval/
EVAL.yaml
cases/
fix-null-check/
case.yaml
add-greeting/
case.yaml
workspace/ # optional per-case workspace template
setup-files...
EVAL.yaml
name: my-benchmark
tests: ./cases/

Each case.yaml is a single YAML object (not an array) with the same fields as an inline test:

cases/fix-null-check/case.yaml
criteria: Fixes the null reference bug in the parser module
input: Fix the null check bug in parser.ts

Behavior:

  • Directory name as id: If case.yaml doesn’t specify an id, the directory name is used (e.g., fix-null-check)
  • Alphabetical ordering: Subdirectories are sorted alphabetically for deterministic order
  • Per-case workspace: A workspace/ subdirectory inside the case directory automatically sets workspace.template to that path, unless the case already defines a workspace field
  • Skipped directories: Subdirectories without case.yaml are skipped with a warning
  • Suite-level config applies: Suite-level assertions, input, workspace, and experiment still apply to directory-discovered cases

This pattern is useful for benchmarks with many cases, where each case benefits from its own directory for workspace templates, supporting files, or documentation. For guidance on keeping provenance metadata, patches, oracle files, and generated dataset rows out of oversized inline YAML, see Benchmark Provenance.

All string fields in eval files support ${{ VAR }} syntax for environment variable interpolation. This enables portable eval configs that work across machines and CI environments without hardcoded paths.

workspace:
repos:
- path: ./RepoA
repo: "${{ REPO_A_URL }}"
commit: "${{ REPO_A_COMMIT }}"
tests:
- id: test-1
input: "Evaluate the code in ${{ PROJECT_NAME }}"
criteria: "${{ EVAL_CRITERIA }}"
  • Syntax: ${{ VARIABLE_NAME }} with optional whitespace around the name
  • Missing variables resolve to an empty string
  • Partial interpolation is supported: ${{ HOME }}/repos/${{ PROJECT }} becomes /home/user/repos/myproject
  • Non-string values (numbers, booleans) are not affected
  • Interpolation is applied recursively to all nested objects and arrays
  • Works in YAML eval files, external YAML/JSONL case files, and external workspace config files
  • .env files in the directory hierarchy are loaded automatically before interpolation
# workspace.yaml — works on any machine
repos:
- path: ./my-repo
repo: "${{ MY_REPO_URL }}"
commit: "${{ MY_REPO_COMMIT }}"
.env
MY_REPO_URL=https://github.com/org/my-repo.git
MY_REPO_COMMIT=main

Eval YAML also supports per-test vars for data-driven prompt templates. Use {{name}} placeholders in test-facing text fields, and AgentV resolves them when the suite loads.

input: "Answer clearly: {{question}}"
tests:
- id: capital
vars:
question: What is the capital of France?
expected_answer: Paris
criteria: "Answers {{question}} correctly"
input:
- role: user
content: "Question: {{question}}"
expected_output: "{{expected_answer}}"
  • vars is defined per test as an object
  • {{name}} and dotted paths like {{ user.name }} are supported
  • Substitution applies to suite-level input, test input, input_files, criteria, expected_output, and conversation turn input / expected_output
  • When the whole string is a single placeholder, the original JSON value is preserved
  • Missing variables are left unchanged, so unrelated template syntax is not silently blanked out
  • vars interpolation is separate from environment interpolation: {{question}} uses test data, ${{ PROJECT_NAME }} uses environment variables

For large-scale evaluations, AgentV supports JSONL (JSON Lines) format. Each line is a single test:

{"id": "test-1", "criteria": "Calculates correctly", "input": "What is 2+2?"}
{"id": "test-2", "criteria": "Provides explanation", "input": "Explain variables"}

An optional YAML sidecar file provides metadata and execution config. Place it alongside the JSONL file with the same base name:

dataset.jsonl + dataset.eval.yaml:

description: Math evaluation dataset
suite: math-tests
experiment:
target: azure-base
assertions:
- name: correctness
type: llm-grader
prompt: ./graders/correctness.md
  • Streaming-friendly — process line by line
  • Git-friendly — diffs show individual case changes
  • Programmatic generation — easy to create from scripts
  • Industry standard — compatible with DeepEval, LangWatch, Hugging Face datasets

Use the convert command to switch between YAML and JSONL:

Terminal window
agentv convert evals/dataset.eval.yaml --format jsonl
agentv convert evals/dataset.jsonl --format yaml