Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 35 additions & 1 deletion apps/web/src/content/docs/docs/evaluation/running-evals.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ sidebar:
agentv eval evals/my-eval.yaml
```

Results are written to `.agentv/results/runs/<timestamp>/index.jsonl`. Each line is a JSON object with one result per test case, and the run workspace also stores the manifest and related artifacts.
Results are written to `.agentv/results/runs/<timestamp>/index.jsonl`. Each line is a JSON object with one result per test case, and the run workspace also stores the manifest and related artifacts. Use this generated run folder as the portable audit surface: copy or sync the run directory, not a hand-authored parallel bundle.

Each `scores[]` entry includes per-grader timing:

Expand Down Expand Up @@ -89,6 +89,40 @@ agentv eval evals/my-eval.yaml --output ./my-results
`--output` is a run directory, not a file path. The canonical manifest is always
`<output>/index.jsonl`.

### Generated Task Bundles

Each result can also include a generated task bundle inside its per-test artifact
directory. The bundle captures the eval slice and target settings that produced
that row, so reviewers and rerun tooling can inspect the exact run-local source
instead of relying on a mutable checkout.

Typical layout:

```text
my-results/
index.jsonl
benchmark.json
<test-id>/
grading.json
timing.json
input.md
outputs/response.md
task/
EVAL.yaml
targets.yaml
files/ # copied input files when the case references them
graders/ # copied grader prompt/script files when applicable
```

The `index.jsonl` row links to these generated paths with snake_case fields such
as `artifact_dir`, `task_dir`, `eval_path`, `targets_path`, `files_path`, and
`graders_path`. Treat those paths as relative to the run directory. When you need
a portable artifact for audit, review, Dashboard inspection, or rerun workflows,
share the generated run directory and its `index.jsonl` manifest. Source-side
case directories are still useful for organizing bulky prompts, fixtures, or
tests while authoring an eval, but they are optional input organization rather
than a separate artifact schema.

### Export Additional Formats

Write additional output files alongside the artifact directory. Format is inferred from the file extension (`.jsonl`, `.json`, `.xml`, `.yaml`, `.html`):
Expand Down
31 changes: 17 additions & 14 deletions apps/web/src/content/docs/docs/guides/benchmark-provenance.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@ verification commands. AgentV represents that with existing primitives:
- Put runtime behavior in `workspace`, `execution`, `input`, `expected_output`,
and `assertions`.
- Put provenance and classification in per-case `metadata`.
- Put bulky per-case artifacts in case directories and supporting files.
- Put bulky per-case authoring inputs in optional case directories and supporting files.
- Use generated run folders, not hand-authored source bundles, as the portable audit artifact.

These are documentation patterns, not special runtime schema keys. AgentV does
not interpret keys such as `source_commit`, `test_patch`, or `question_type`
Expand Down Expand Up @@ -64,7 +65,7 @@ on per-case metadata such as a patch path, source row, or selected test list.

## Task Artifact Anatomy

Benchmark task packs map cleanly onto AgentV fields:
Benchmark task packs map cleanly onto AgentV fields at authoring time:

| Task artifact | AgentV pattern |
|---------------|----------------|
Expand All @@ -74,14 +75,16 @@ Benchmark task packs map cleanly onto AgentV fields:
| Gold answer | `expected_output` when the answer is passive reference data |
| Active verification | `assertions`, especially `code-grader` for commands or artifact checks |
| Provenance | `tests[].metadata` with source pins, generator rows, and curation labels |
| Bulky task files | `tests: ./cases/` with per-case directories and supporting files |
| Bulky task files | Optional `tests: ./cases/` with per-case directories and supporting files |

This mirrors the common task shape used by filesystem-native benchmark harnesses:
Margin keeps each task's prompt, case metadata, tests, environment, and optional
oracle in a case directory; Terminal-Bench and Harbor keep task instructions,
container setup, run-test scripts, and result artifacts as separate files. In
AgentV, keep the same separation but bind it with eval YAML instead of adding a
large benchmark-specific schema.
Use this separation only when it makes the source eval easier to maintain. It is
not a first-class artifact schema. After an eval runs, AgentV writes the portable
audit surface into the generated run folder: each result can link from
`index.jsonl` to a run-local `task/` bundle containing `EVAL.yaml`,
`targets.yaml`, and copied `files/` or `graders/` snapshots where applicable.
Review, Dashboard files views, and rerun workflows should inspect those generated
run artifacts instead of requiring authors to maintain a parallel source-side
bundle layout. See [Generated Task Bundles](/docs/evaluation/running-evals/#generated-task-bundles).

## SWE-Style Case

Expand Down Expand Up @@ -181,11 +184,11 @@ regeneration checks. If a hook or grader needs the source file at runtime, clone
it through `workspace.repos` or make the generator output available as a normal
fixture file.

## When to Split Into Case Directories
## Optional Source-Side Case Directories

Inline YAML is fine when a case has a short prompt, a short expected answer, and
a few metadata fields. Move away from inline YAML when the benchmark starts
accumulating task-local artifacts:
a few metadata fields. Move source inputs into case directories only when the
benchmark starts accumulating bulky authoring resources:

- The case has patches, hidden tests, oracle JSON, screenshots, reports, or
fixture files.
Expand Down Expand Up @@ -260,5 +263,5 @@ script.
- Keep `metadata` snake_case because it crosses process and result boundaries.
- Prefer `expected_output` for passive gold answers and `code-grader` for active
commands, file checks, or generated artifact validation.
- Prefer case directories over long inline YAML once task artifacts become part
of the benchmark contract.
- Prefer case directories over long inline YAML only for bulky source inputs;
the generated run folder remains the portable artifact contract.
2 changes: 1 addition & 1 deletion apps/web/src/content/docs/docs/tools/dashboard.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ You can also set the same field globally in `$AGENTV_HOME/config.yaml` or `~/.ag

## Run Detail

Click any run to see a breakdown by suite, per-test scores, target, duration, and cost. The source label (`local` or `remote`) tells you where the run came from.
Click any run to see a breakdown by suite, per-test scores, target, duration, and cost. The source label (`local` or `remote`) tells you where the run came from. Files and source views resolve against the generated run artifacts referenced by `index.jsonl`—including per-result task bundles when present—so Dashboard does not require authors to create a separate source-side bundle structure.

<Image src={studioRunDetail} alt="AgentV Dashboard run detail showing 100% pass rate across 5 tests with scores and duration" />

Expand Down
2 changes: 1 addition & 1 deletion apps/web/src/content/docs/docs/tools/results.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ Use `results export` when you need the artifact workspace layout itself rather t
agentv results export <run-workspace-or-index.jsonl> [--out <dir>]
```

This is useful when a manifest needs to be materialized into a predictable artifact tree for other tooling, review, or archiving.
This is useful when a manifest needs to be materialized into a predictable artifact tree for other tooling, review, or archiving. The run workspace is also where generated task bundles live: `index.jsonl` rows may point to per-result `task_dir`, `eval_path`, `targets_path`, `files_path`, and `graders_path` entries. Keep those generated artifacts with the run when sharing or auditing results.

## Inspection helpers

Expand Down
Loading