Experiments

AgentV eval files are the only runnable authoring artifact. Use top-level experiment: inside eval.yaml for runtime choices: targets, workers, timeout, sandbox/runtime knobs, budgets, thresholds, and repeat-run policy. AgentV does not have a separate experiment.yaml file, top-level run_group, or schema-significant experiments/ directory.

name: support-regression

experiment:
  targets: [codex-gpt5, claude-sonnet]
  workers: 2
  timeout_seconds: 720
  repeat:
    count: 4
    strategy: pass_at_k
    cost_limit_usd: 2.00

workspace:
  hooks:
    before_all:
      command: ["bash", "-lc", "bun install && bun run build"]

tests:
  - id: refund-eligibility
    input: Can this customer get a refund?
    criteria: Applies the refund policy correctly

execution: is accepted only as a legacy top-level alias for existing eval files. Do not use both experiment: and execution: in the same eval.

Layout Conventions

Use directories for human organization, not schema behavior. A common layout is:

evals/
  suites/
    refunds.eval.yaml
  cases/
    refund-smoke.cases.yaml
experiments/
  refunds-codex.eval.yaml

In that layout, evals/suites/refunds.eval.yaml is a reusable task suite, evals/cases/refund-smoke.cases.yaml is raw case data, and experiments/refunds-codex.eval.yaml is a wrapper eval. The wrapper still runs only because it is eval YAML:

experiment:
  name: refunds-codex
  target: codex-gpt5
  workers: 2

tests:
  - include: ../evals/suites/refunds.eval.yaml
    type: suite
  - include: ../evals/cases/refund-smoke.cases.yaml
    type: tests

The experiments/ folder is optional and user-owned. AgentV does not scan it for special files or infer runtime behavior from the path; the same wrapper eval could live under evals/wrappers/, benchmarks/, or beside the suite it runs.

Tests Imports

Use tests[] for composition, imports, and selection.

tests:
  - include: evals/support/*.eval.yaml
    type: suite
    select:
      test_ids:
        - refund-*
        - missing-order-date
      tags: regression
      metadata:
        priority: high
    run:
      threshold: 1.0
      repeat:
        count: 2
        strategy: pass_all
  - include: cases/*.cases.yaml
    type: tests
  - include: cases/regression.jsonl
    type: tests
  - cases/smoke/*.cases.yaml

type: suite preserves the imported suite’s task contract: metadata, workspace, shared input, shared assertions, and tests. The parent eval still owns the single run bundle and runtime policy. Child suite experiment: blocks are ignored when imported; use parent experiment: for run policy and tests[].run for scoped threshold, repeat, timeout, or budget overrides.

A parent eval that imports any type: suite entry must not define top-level workspace. Imported suites own task environment. If the parent should provide workspace context, import raw cases with type: tests or shorthand paths instead of importing an eval suite.

type: tests imports only raw test entries. It intentionally drops shared context from an imported eval suite, so parent suite fields apply to those raw cases.

tests[].select.test_ids filters imported test IDs with glob patterns. tests[].select.tags filters each imported case’s effective metadata.tags. Effective case tags are suite-first and deduped: suite.tags + suite.metadata.tags + test.metadata.tags. Top-level suite tags still remain suite identity metadata for discovery and reporting; selection reads the merged case metadata view. tests[].select.metadata filters case metadata by key/value, where selector values may be scalars or lists. Globbed include paths are resolved in deterministic path order, then test order.

String-valued tests and string entries inside tests[] are raw-case import shorthand. They are equivalent to include with type: tests and may point at raw case files, directories, or globs. Importing another eval suite must use object form with include: and type: suite.

Suite imports are resolved as a deterministic include graph. Circular type: suite imports fail validation with the import chain; raw-case shorthand does not recursively load suite runtime blocks.

Imported suite rows keep their source suite metadata in index.jsonl. Use each row’s result_dir as the authoritative path to generated artifacts inside the run directory; do not infer layout from suite names.

Scoped Run Overrides

Use scoped run: blocks for result interpretation and scheduling policies that vary by include group or test case. Precedence is:

test.run > tests[].run > parent experiment

experiment:
  target: agent
  threshold: 0.8
  repeat:
    count: 3
    strategy: pass_at_k

tests:
  - include: ./evals/flaky-agentic/**/*.eval.yaml
    type: suite
    select:
      tags: [agentic]
    run:
      repeat:
        count: 3
        strategy: pass_at_k

  - include: ./evals/regression/**/*.eval.yaml
    type: suite
    select:
      tags: [must-pass]
    run:
      threshold: 1.0
      repeat:
        count: 2
        strategy: pass_all

  - id: critical-case
    input: "..."
    criteria: Must pass exactly
    run:
      threshold: 1.0
      repeat:
        count: 1

Scoped run: supports threshold, repeat, timeout_seconds, and budget_usd. Candidate-changing fields such as target and targets stay parent-level under experiment:. Workspace mutation belongs in workspace.hooks, and runner-specific setup belongs in targets[].hooks.

Lifecycle Ownership

experiment: configures evaluation policy. It does not own commands that prepare files, dependencies, repos, or target-specific runner state.

Need	Put it in
Install dependencies, build the repo, seed files	`workspace.hooks.before_all`
Reset or apply per-case state	`workspace.hooks.before_each` / `workspace.hooks.after_each`
Configure an agent runner or provider variant	`targets[].hooks`
Choose targets, repeats, pass policy, budget, threshold	`experiment`
Override run workspace mode/path without changing task setup	`experiment.workspace.mode` / `experiment.workspace.path`

workspace:
  hooks:
    before_all:
      command: ["bash", "-lc", "bun install && bun run build"]

targets:
  - name: agent-with-skills
    provider: codex
    hooks:
      before_each:
        command: ["sh", "-c", "cp -R skills \"{{workspace_path}}/.codex/skills\""]

experiment:
  target: agent-with-skills
  repeat:
    count: 3
    strategy: pass_at_k

experiment.workspace is intentionally limited to mode and path, matching the --workspace-mode and --workspace-path CLI flags. Put repos, templates, hooks, Docker config, and isolation under top-level or case-level workspace. Wrapper evals that import child evals with type: suite must not define experiment.workspace; imported suites own the task workspace.

Repeat Runs

repeat supports the same core strategies as repeated attempts:

experiment:
  repeat:
    count: 3
    strategy: mean
    cost_limit_usd: 1.50

Supported strategies:

Strategy	Behavior
`pass_at_k`	Uses the best passing attempt; early-exits by default unless `early_exit: false` is set
`pass_all`	Uses the weakest attempt score, so every repeated attempt must meet the threshold
`mean`	Aggregates repeated attempt scores by mean
`confidence_interval`	Uses the lower bound of a 95% confidence interval as the conservative score

AgentV also accepts runs and early_exit under experiment: as shorthand for repeat-run policy:

experiment:
  runs: 4
  early_exit: true

Do not set both repeat and runs in the same runtime block.

Result Layout

Eval runs write to the selected result group:

.agentv/results/<result-group>/<timestamp>/

CLI --experiment sets the result group explicitly. Without that flag, AgentV derives the group from the eval input: a single eval uses the eval metadata name when present or the eval filename otherwise, and multiple eval files use multi-eval. Inline experiment.name does not currently select the result group.

Imported source suite metadata appears in index.jsonl rows and manifests. Use index.jsonl fields such as eval_path, test_id, target, and result_dir for identity and artifact discovery instead of reconstructing paths from suite names or wrapper layout.