Eval Files

Evaluation files define the test cases, graders, workspace lifecycle, and inline runtime block for an evaluation run. Runtime choices such as target matrices, thresholds, budgets, and repeat runs belong under top-level experiment:. Install, build, and reset commands belong under workspace.hooks; runner-specific setup belongs under targets[].hooks. AgentV supports two eval data formats: YAML and JSONL.

YAML is the canonical portable model. TypeScript helpers, generated fixtures, and Python scripts should lower to the same YAML/JSONL shapes rather than inventing a separate eval contract.

Authoring Shapes

Eval YAML is AgentV’s composable and runnable authoring primitive. Use ordinary *.eval.yaml files for direct task suites and for wrapper evals that compose other suites. Raw case files are reusable data inputs, not a second runnable experiment format.

A task suite is eval YAML that owns task context: workspace, shared input, shared assertions, and test cases. It can run directly or be imported with type: suite.
A raw case file is a YAML/JSONL array, directory, or glob of cases. Import it with tests: ./cases.yaml, string shorthand, or type: tests; parent suite context applies because raw cases do not carry their own suite context.
A wrapper eval is eval YAML that imports one or more suites with type: suite and binds runtime policy in its inline experiment: block. Wrapper evals can live anywhere in the repo. A wrapper that imports suites with type: suite must not define parent workspace fields such as workspace, experiment.workspace, or legacy execution.workspace; imported suites own task environment.

For example, a reusable task suite can keep the task contract in one file:

suite: refunds
workspace:
  repos:
    - path: ./support-app
      repo: acme/support-app
      commit: main
input: Answer using the refund policy in the workspace.
assertions:
  - Applies the refund policy correctly
tests:
  - id: missing-receipt
    input: Can this customer get a refund without a receipt?

Raw cases are just case data:

- id: damaged-item
  input: The item arrived damaged. What should support do?
  expected_output: Offer a replacement or refund path.

A wrapper eval stays ordinary eval YAML while choosing runtime policy:

experiment:
  name: refunds-codex
  target: codex-gpt5
  repeat:
    count: 2
    strategy: pass_all

tests:
  - include: ../evals/suites/refunds.eval.yaml
    type: suite
  - include: ../evals/cases/refund-smoke.cases.yaml
    type: tests

The experiments/ directory in that example is optional and user-owned. AgentV does not infer behavior from the path; the wrapper runs because it is eval YAML with an inline experiment: block. The wrapper owns runtime policy only. Put workspace setup in imported child suites. Parent workspace-affecting fields, including workspace, experiment.workspace, and legacy execution.workspace, are for parent-owned raw cases, including cases imported with type: tests. experiment.workspace is only a runtime mode/path override; repos, hooks, templates, Docker config, and isolation belong in top-level or case-level workspace.

YAML Format

The primary format. A single file contains metadata, inline runtime config, and tests:

description: Math problem solving evaluation
experiment:
  target: default

assertions:
  - Correctly calculates the answer
  - Explains the calculation briefly

tests:
  - id: addition
    input: What is 15 + 27?
    expected_output: "42"

Top-level Fields

Field	Description
`description`	Human-readable description of the evaluation
`suite`	Optional suite identifier
`experiment`	Runtime policy (`target`, `targets`, `workers`, `repeat`, `threshold`, `timeout_seconds`, `budget_usd`, etc.)
`workspace`	Suite-level task environment — inline object or string path to an external workspace file. Repo entries declare identity and checkout pins; acquisition is covered in Workspace Architecture.
`tests`	Array of individual tests, include entries, or a string path to an external file or directory. Tests and include entries may use scoped `run:` overrides for `threshold`, `repeat`, `timeout_seconds`, and `budget_usd`.
`assertions`	Suite-level graders appended to each test unless `execution.skip_defaults: true` is set on the test
`input`	Suite-level input messages prepended to each test’s input unless `execution.skip_defaults: true` is set on the test

workspace is what the agent can inspect or modify through tools, not prompt input. Put instructions in input; put repos, templates, and lifecycle setup in workspace.

For historical or repo-state evals, put the checkout under workspace.repos[].commit or workspace.repos[].base_commit. A commit SHA in the prompt or metadata is useful context, but it does not materialize a repo for the agent to inspect.

Metadata Fields

You can add structured metadata to your eval file using these optional top-level fields. Metadata is parsed when the name field is present:

Field	Description
`name`	Machine-readable identifier (lowercase, hyphens, max 64 chars). Triggers metadata parsing.
`description`	Human-readable description (max 1024 chars)
`version`	Eval version string (e.g., `"1.0"`)
`author`	Author or team identifier
`tags`	Array of string tags for categorization
`license`	License identifier (e.g., `"MIT"`, `"Apache-2.0"`)
`requires`	Dependency constraints (e.g., `agentv: ">=0.30.0"`)

name: export-screening
description: Evaluates export control screening accuracy
version: "1.0"
author: acme-compliance
tags: [compliance, agents]
license: Apache-2.0
requires:
  agentv: ">=0.30.0"

tests:
  - id: denied-party
    criteria: Identifies denied parties correctly
    input: Screen "Acme Corp" against denied parties list

Suite-level Assertions

The assertions field is the canonical way to define suite-level graders. Suite-level assertions are appended to every test’s graders unless a test sets execution.skip_defaults: true. For semantic or agent-behavior checks, prefer plain assertion strings first; AgentV treats them as rubric criteria. Use deterministic assertions or code graders when the expected output is exact or requires programmatic inspection. If the assertion strings already state the grading contract, omit a duplicate criteria field on each test. Use explicit type: llm-grader entries only when you need a custom prompt, a custom grader target, or a deliberately separate grader panel.

description: API response validation
assertions:
  - type: is-json
    required: true
  - type: contains
    value: "status"
  - Correctly answers the user's question
  - Explains the reasoning clearly

tests:
  - id: health-check
    input: Check API health

assertions supports rubric shorthand strings, deterministic assertion types (contains, regex, is_json, equals), rubrics, LLM graders, and code graders. See Tests for per-test assertions usage.

Assertion Includes

Reusable assertion sets can be factored into template files and referenced from any assertions array:

assertions:
  - include: safe-response
  - include: ./shared/format.yaml

Resolution rules:

include: name resolves to .agentv/templates/{name}.yaml with the closest matching directory winning
Relative paths resolve from the eval file location, so include: ./shared/format.yaml works as expected
Nested includes are allowed up to depth 3 to keep cycles and runaway recursion bounded
Suite-level includes follow the same merge behavior as other suite-level assertions and still respect execution.skip_defaults: true

Suite-level Input

The input field defines messages that are prepended to every test’s input. This avoids repeating the same prompt or system context in each test case — following the same pattern as suite-level assertions.

description: Travel assistant evaluation
input: "Answer as a concise travel assistant."

tests: ./cases.yaml

Use a block scalar for multi-line shared instructions:

input: |
  Read AGENTS.md before answering.
  Explain the tradeoffs clearly.

tests: ./cases.yaml

Each test in cases.yaml only needs its own query:

- id: japan-spring
  criteria: Recommends spring for cherry blossoms
  input: When is the best time to visit Japan?

The effective input at runtime becomes [...suite input, ...test input].

Suite-level input accepts the same formats as test-level input:

String — wrapped as [{ role: "user", content: "..." }]
Object without a top-level role key — wrapped as structured user-message content
Single message object — a { role, content } object using a supported message role
Message array — used as-is, including system messages and file references

The top-level role key is reserved for message objects. If your structured payload needs a field named role, nest it under another key.

input:
  - role: system
    content: You are a careful reviewer.
  - role: user
    content:
      - type: file
        value: ./system-prompt.md

To opt out for a specific test, set execution.skip_defaults: true (same flag that skips suite-level assertions).

Suite-level Input Files

The input_files field provides a shorthand for attaching shared file references to every test. When a test has a string input, the suite-level files are prepended as type: file content blocks in a single user message — the same shape produced by per-test input_files.

description: Schema review evaluation
input_files:
  - ./shared-context.md
  - ./schema.json

tests:
  - id: summarize
    criteria: Summarizes the important constraints
    input: Summarize the important constraints.
  - id: validate
    criteria: Identifies validation gaps
    input: What validation is missing?

Each test’s effective input becomes a single user message with [file blocks..., text block].

Per-test input_files overrides the suite-level value (it does not merge). To opt out, set execution.skip_defaults: true on the test.

PROMPT.md Fallback

For directory-style evals, a test may omit input and keep the task prompt in Markdown instead. AgentV resolves the prompt in this order:

If the effective input_files contains a file named exactly PROMPT.md, that file becomes the test prompt.
Otherwise, if a PROMPT.md exists beside the EVAL.yaml, that file becomes the test prompt.
Other input_files remain attachments. PROMPT.md is removed from the attachment list so the prompt is not duplicated.

agent-001-fix-bug/
  EVAL.yaml
  PROMPT.md
  fixtures/
    failing-test.log

tests:
  - id: fix-bug
    criteria: Fixes the regression described in the prompt
    input_files:
      - ./fixtures/failing-test.log

Use explicit input when the prompt is short or generated from YAML variables. Use PROMPT.md when the task text is long enough that duplicating it inside YAML would make the eval hard to review.

Raw Cases as String Paths

Instead of inlining tests in the same file, you can point tests to an external YAML or JSONL file of raw cases. This is the inverse of the sidecar pattern — the metadata file references the test data:

name: my-eval
description: My evaluation suite
experiment:
  target: default
tests: ./cases.yaml

The path is resolved relative to the eval file’s directory. The external raw case file should contain a YAML array of test objects or a JSONL file with one test per line. String entries inside a tests: list work the same way and may use direct paths, directories, or globs:

tests:
  - ./cases/*.cases.yaml
  - include: ./suites/*.eval.yaml
    type: suite

String shorthand is raw-case-only. Import reusable task suites with object entries using include: and type: suite; use type: tests when you want to drop suite context and import only raw cases.

Raw Cases as Directory Paths

When tests points to a directory, AgentV auto-discovers test cases from subdirectories. Each subdirectory containing a case.yaml (or case.yml) becomes a test case:

my-eval/
  EVAL.yaml
  cases/
    fix-null-check/
      case.yaml
    add-greeting/
      case.yaml
      workspace/        # optional per-case workspace template
        setup-files...

name: my-benchmark
tests: ./cases/

Each case.yaml is a single YAML object (not an array) with the same fields as an inline test:

criteria: Fixes the null reference bug in the parser module
input: Fix the null check bug in parser.ts

Behavior:

Directory name as id: If case.yaml doesn’t specify an id, the directory name is used (e.g., fix-null-check)
Alphabetical ordering: Subdirectories are sorted alphabetically for deterministic order
Per-case workspace: A workspace/ subdirectory inside the case directory automatically sets workspace.template to that path, unless the case already defines a workspace field
Skipped directories: Subdirectories without case.yaml are skipped with a warning
Suite-level config applies: Suite-level assertions, input, workspace, and experiment still apply to directory-discovered cases

This pattern is useful for benchmarks with many cases, where each case benefits from its own directory for workspace templates, supporting files, or documentation. For guidance on keeping provenance metadata, patches, oracle files, and generated dataset rows out of oversized inline YAML, see Benchmark Provenance.

Environment Variable Interpolation

All string fields in eval files support ${{ VAR }} syntax for environment variable interpolation. This enables portable eval configs that work across machines and CI environments without hardcoded paths.

workspace:
  repos:
    - path: ./RepoA
      repo: "${{ REPO_A_URL }}"
      commit: "${{ REPO_A_COMMIT }}"

tests:
  - id: test-1
    input: "Evaluate the code in ${{ PROJECT_NAME }}"
    criteria: "${{ EVAL_CRITERIA }}"

Behavior

Syntax: ${{ VARIABLE_NAME }} with optional whitespace around the name
Missing variables resolve to an empty string
Partial interpolation is supported: ${{ HOME }}/repos/${{ PROJECT }} becomes /home/user/repos/myproject
Non-string values (numbers, booleans) are not affected
Interpolation is applied recursively to all nested objects and arrays
Works in YAML eval files, external YAML/JSONL case files, and external workspace config files
.env files in the directory hierarchy are loaded automatically before interpolation

Example: Portable Workspace Config

# workspace.yaml — works on any machine
repos:
  - path: ./my-repo
    repo: "${{ MY_REPO_URL }}"
    commit: "${{ MY_REPO_COMMIT }}"

MY_REPO_URL=https://github.com/org/my-repo.git
MY_REPO_COMMIT=main

Per-Test Template Variables

Eval YAML also supports per-test vars for data-driven prompt templates. Use {{name}} placeholders in test-facing text fields, and AgentV resolves them when the suite loads.

input: "Answer clearly: {{question}}"

tests:
  - id: capital
    vars:
      question: What is the capital of France?
      expected_answer: Paris
    criteria: "Answers {{question}} correctly"
    input:
      - role: user
        content: "Question: {{question}}"
    expected_output: "{{expected_answer}}"

Behavior

vars is defined per test as an object
{{name}} and dotted paths like {{ user.name }} are supported
Substitution applies to suite-level input, test input, input_files, criteria, expected_output, and conversation turn input / expected_output
When the whole string is a single placeholder, the original JSON value is preserved
Missing variables are left unchanged, so unrelated template syntax is not silently blanked out
vars interpolation is separate from environment interpolation: {{question}} uses test data, ${{ PROJECT_NAME }} uses environment variables

JSONL Format

For large-scale evaluations, AgentV supports JSONL (JSON Lines) format. Each line is a single test:

{"id": "test-1", "criteria": "Calculates correctly", "input": "What is 2+2?"}
{"id": "test-2", "criteria": "Provides explanation", "input": "Explain variables"}

Sidecar Metadata

An optional YAML sidecar file provides metadata and execution config. Place it alongside the JSONL file with the same base name:

dataset.jsonl + dataset.eval.yaml:

description: Math evaluation dataset
suite: math-tests
experiment:
  target: azure-base
assertions:
  - name: correctness
    type: llm-grader
    prompt: ./graders/correctness.md

Benefits of JSONL

Streaming-friendly — process line by line
Git-friendly — diffs show individual case changes
Programmatic generation — easy to create from scripts
Industry standard — compatible with DeepEval, LangWatch, Hugging Face datasets

Converting Between Formats

Use the convert command to switch between YAML and JSONL:

agentv convert evals/dataset.eval.yaml --format jsonl
agentv convert evals/dataset.jsonl --format yaml