Skip to content

Experiments

AgentV eval files are the only runnable authoring artifact. Use top-level experiment: inside eval.yaml for runtime choices: targets, workers, timeout, sandbox/runtime knobs, budgets, thresholds, and repeat-run policy. AgentV does not have a separate experiment.yaml file, top-level run_group, or schema-significant experiments/ directory.

name: support-regression
experiment:
targets: [codex-gpt5, claude-sonnet]
workers: 2
timeout_seconds: 720
repeat:
count: 4
strategy: pass_at_k
cost_limit_usd: 2.00
workspace:
hooks:
before_all:
command: ["bash", "-lc", "bun install && bun run build"]
tests:
- id: refund-eligibility
input: Can this customer get a refund?
criteria: Applies the refund policy correctly

execution: is accepted only as a legacy top-level alias for existing eval files. Do not use both experiment: and execution: in the same eval.

Use directories for human organization, not schema behavior. A common layout is:

evals/
suites/
refunds.eval.yaml
cases/
refund-smoke.cases.yaml
experiments/
refunds-codex.eval.yaml

In that layout, evals/suites/refunds.eval.yaml is a reusable task suite, evals/cases/refund-smoke.cases.yaml is raw case data, and experiments/refunds-codex.eval.yaml is a wrapper eval. The wrapper still runs only because it is eval YAML:

experiments/refunds-codex.eval.yaml
experiment:
name: refunds-codex
target: codex-gpt5
workers: 2
tests:
- include: ../evals/suites/refunds.eval.yaml
type: suite
- include: ../evals/cases/refund-smoke.cases.yaml
type: tests

The experiments/ folder is optional and user-owned. AgentV does not scan it for special files or infer runtime behavior from the path; the same wrapper eval could live under evals/wrappers/, benchmarks/, or beside the suite it runs.

Use tests[] for composition, imports, and selection.

tests:
- include: evals/support/*.eval.yaml
type: suite
select:
test_ids:
- refund-*
- missing-order-date
tags: regression
metadata:
priority: high
run:
threshold: 1.0
repeat:
count: 2
strategy: pass_all
- include: cases/*.cases.yaml
type: tests
- include: cases/regression.jsonl
type: tests
- cases/smoke/*.cases.yaml

type: suite preserves the imported suite’s task contract: metadata, workspace, shared input, shared assertions, and tests. The parent eval still owns the single run bundle and runtime policy. Child suite experiment: blocks are ignored when imported; use parent experiment: for run policy and tests[].run for scoped threshold, repeat, timeout, or budget overrides.

A parent eval that imports any type: suite entry must not define top-level workspace. Imported suites own task environment. If the parent should provide workspace context, import raw cases with type: tests or shorthand paths instead of importing an eval suite.

type: tests imports only raw test entries. It intentionally drops shared context from an imported eval suite, so parent suite fields apply to those raw cases.

tests[].select.test_ids filters imported test IDs with glob patterns. tests[].select.tags filters each imported case’s effective metadata.tags. Effective case tags are suite-first and deduped: suite.tags + suite.metadata.tags + test.metadata.tags. Top-level suite tags still remain suite identity metadata for discovery and reporting; selection reads the merged case metadata view. tests[].select.metadata filters case metadata by key/value, where selector values may be scalars or lists. Globbed include paths are resolved in deterministic path order, then test order.

String-valued tests and string entries inside tests[] are raw-case import shorthand. They are equivalent to include with type: tests and may point at raw case files, directories, or globs. Importing another eval suite must use object form with include: and type: suite.

Suite imports are resolved as a deterministic include graph. Circular type: suite imports fail validation with the import chain; raw-case shorthand does not recursively load suite runtime blocks.

Imported suite rows keep their source suite metadata in index.jsonl. Use each row’s result_dir as the authoritative path to generated artifacts inside the run directory; do not infer layout from suite names.

Use scoped run: blocks for result interpretation and scheduling policies that vary by include group or test case. Precedence is:

test.run > tests[].run > parent experiment
experiment:
target: agent
threshold: 0.8
repeat:
count: 3
strategy: pass_at_k
tests:
- include: ./evals/flaky-agentic/**/*.eval.yaml
type: suite
select:
tags: [agentic]
run:
repeat:
count: 3
strategy: pass_at_k
- include: ./evals/regression/**/*.eval.yaml
type: suite
select:
tags: [must-pass]
run:
threshold: 1.0
repeat:
count: 2
strategy: pass_all
- id: critical-case
input: "..."
criteria: Must pass exactly
run:
threshold: 1.0
repeat:
count: 1

Scoped run: supports threshold, repeat, timeout_seconds, and budget_usd. Candidate-changing fields such as target and targets stay parent-level under experiment:. Workspace mutation belongs in workspace.hooks, and runner-specific setup belongs in targets[].hooks.

experiment: configures evaluation policy. It does not own commands that prepare files, dependencies, repos, or target-specific runner state.

NeedPut it in
Install dependencies, build the repo, seed filesworkspace.hooks.before_all
Reset or apply per-case stateworkspace.hooks.before_each / workspace.hooks.after_each
Configure an agent runner or provider varianttargets[].hooks
Choose targets, repeats, pass policy, budget, thresholdexperiment
Override run workspace mode/path without changing task setupexperiment.workspace.mode / experiment.workspace.path
workspace:
hooks:
before_all:
command: ["bash", "-lc", "bun install && bun run build"]
targets:
- name: agent-with-skills
provider: codex
hooks:
before_each:
command: ["sh", "-c", "cp -R skills \"{{workspace_path}}/.codex/skills\""]
experiment:
target: agent-with-skills
repeat:
count: 3
strategy: pass_at_k

experiment.workspace is intentionally limited to mode and path, matching the --workspace-mode and --workspace-path CLI flags. Put repos, templates, hooks, Docker config, and isolation under top-level or case-level workspace. Wrapper evals that import child evals with type: suite must not define experiment.workspace; imported suites own the task workspace.

repeat supports the same core strategies as repeated attempts:

experiment:
repeat:
count: 3
strategy: mean
cost_limit_usd: 1.50

Supported strategies:

StrategyBehavior
pass_at_kUses the best passing attempt; early-exits by default unless early_exit: false is set
pass_allUses the weakest attempt score, so every repeated attempt must meet the threshold
meanAggregates repeated attempt scores by mean
confidence_intervalUses the lower bound of a 95% confidence interval as the conservative score

AgentV also accepts runs and early_exit under experiment: as shorthand for repeat-run policy:

experiment:
runs: 4
early_exit: true

Do not set both repeat and runs in the same runtime block.

Eval runs write to the selected result group:

.agentv/results/<result-group>/<timestamp>/

CLI --experiment sets the result group explicitly. Without that flag, AgentV derives the group from the eval input: a single eval uses the eval metadata name when present or the eval filename otherwise, and multiple eval files use multi-eval. Inline experiment.name does not currently select the result group.

Imported source suite metadata appears in index.jsonl rows and manifests. Use index.jsonl fields such as eval_path, test_id, target, and result_dir for identity and artifact discovery instead of reconstructing paths from suite names or wrapper layout.