From f1e43a1db44b1aed3c1e2416992a1c293fb9427e Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Wed, 10 Jun 2026 12:33:48 +0200 Subject: [PATCH] docs(run): document generated task bundles --- .../docs/docs/evaluation/running-evals.mdx | 36 ++++++++++++++++++- .../docs/docs/guides/benchmark-provenance.mdx | 31 ++++++++-------- .../src/content/docs/docs/tools/dashboard.mdx | 2 +- .../src/content/docs/docs/tools/results.mdx | 2 +- 4 files changed, 54 insertions(+), 17 deletions(-) diff --git a/apps/web/src/content/docs/docs/evaluation/running-evals.mdx b/apps/web/src/content/docs/docs/evaluation/running-evals.mdx index 2bf3e561..08b053ce 100644 --- a/apps/web/src/content/docs/docs/evaluation/running-evals.mdx +++ b/apps/web/src/content/docs/docs/evaluation/running-evals.mdx @@ -11,7 +11,7 @@ sidebar: agentv eval evals/my-eval.yaml ``` -Results are written to `.agentv/results/runs//index.jsonl`. Each line is a JSON object with one result per test case, and the run workspace also stores the manifest and related artifacts. +Results are written to `.agentv/results/runs//index.jsonl`. Each line is a JSON object with one result per test case, and the run workspace also stores the manifest and related artifacts. Use this generated run folder as the portable audit surface: copy or sync the run directory, not a hand-authored parallel bundle. Each `scores[]` entry includes per-grader timing: @@ -89,6 +89,40 @@ agentv eval evals/my-eval.yaml --output ./my-results `--output` is a run directory, not a file path. The canonical manifest is always `/index.jsonl`. +### Generated Task Bundles + +Each result can also include a generated task bundle inside its per-test artifact +directory. The bundle captures the eval slice and target settings that produced +that row, so reviewers and rerun tooling can inspect the exact run-local source +instead of relying on a mutable checkout. + +Typical layout: + +```text +my-results/ + index.jsonl + benchmark.json + / + grading.json + timing.json + input.md + outputs/response.md + task/ + EVAL.yaml + targets.yaml + files/ # copied input files when the case references them + graders/ # copied grader prompt/script files when applicable +``` + +The `index.jsonl` row links to these generated paths with snake_case fields such +as `artifact_dir`, `task_dir`, `eval_path`, `targets_path`, `files_path`, and +`graders_path`. Treat those paths as relative to the run directory. When you need +a portable artifact for audit, review, Dashboard inspection, or rerun workflows, +share the generated run directory and its `index.jsonl` manifest. Source-side +case directories are still useful for organizing bulky prompts, fixtures, or +tests while authoring an eval, but they are optional input organization rather +than a separate artifact schema. + ### Export Additional Formats Write additional output files alongside the artifact directory. Format is inferred from the file extension (`.jsonl`, `.json`, `.xml`, `.yaml`, `.html`): diff --git a/apps/web/src/content/docs/docs/guides/benchmark-provenance.mdx b/apps/web/src/content/docs/docs/guides/benchmark-provenance.mdx index 7c07d4b7..704dc05f 100644 --- a/apps/web/src/content/docs/docs/guides/benchmark-provenance.mdx +++ b/apps/web/src/content/docs/docs/guides/benchmark-provenance.mdx @@ -12,7 +12,8 @@ verification commands. AgentV represents that with existing primitives: - Put runtime behavior in `workspace`, `execution`, `input`, `expected_output`, and `assertions`. - Put provenance and classification in per-case `metadata`. -- Put bulky per-case artifacts in case directories and supporting files. +- Put bulky per-case authoring inputs in optional case directories and supporting files. +- Use generated run folders, not hand-authored source bundles, as the portable audit artifact. These are documentation patterns, not special runtime schema keys. AgentV does not interpret keys such as `source_commit`, `test_patch`, or `question_type` @@ -64,7 +65,7 @@ on per-case metadata such as a patch path, source row, or selected test list. ## Task Artifact Anatomy -Benchmark task packs map cleanly onto AgentV fields: +Benchmark task packs map cleanly onto AgentV fields at authoring time: | Task artifact | AgentV pattern | |---------------|----------------| @@ -74,14 +75,16 @@ Benchmark task packs map cleanly onto AgentV fields: | Gold answer | `expected_output` when the answer is passive reference data | | Active verification | `assertions`, especially `code-grader` for commands or artifact checks | | Provenance | `tests[].metadata` with source pins, generator rows, and curation labels | -| Bulky task files | `tests: ./cases/` with per-case directories and supporting files | +| Bulky task files | Optional `tests: ./cases/` with per-case directories and supporting files | -This mirrors the common task shape used by filesystem-native benchmark harnesses: -Margin keeps each task's prompt, case metadata, tests, environment, and optional -oracle in a case directory; Terminal-Bench and Harbor keep task instructions, -container setup, run-test scripts, and result artifacts as separate files. In -AgentV, keep the same separation but bind it with eval YAML instead of adding a -large benchmark-specific schema. +Use this separation only when it makes the source eval easier to maintain. It is +not a first-class artifact schema. After an eval runs, AgentV writes the portable +audit surface into the generated run folder: each result can link from +`index.jsonl` to a run-local `task/` bundle containing `EVAL.yaml`, +`targets.yaml`, and copied `files/` or `graders/` snapshots where applicable. +Review, Dashboard files views, and rerun workflows should inspect those generated +run artifacts instead of requiring authors to maintain a parallel source-side +bundle layout. See [Generated Task Bundles](/docs/evaluation/running-evals/#generated-task-bundles). ## SWE-Style Case @@ -181,11 +184,11 @@ regeneration checks. If a hook or grader needs the source file at runtime, clone it through `workspace.repos` or make the generator output available as a normal fixture file. -## When to Split Into Case Directories +## Optional Source-Side Case Directories Inline YAML is fine when a case has a short prompt, a short expected answer, and -a few metadata fields. Move away from inline YAML when the benchmark starts -accumulating task-local artifacts: +a few metadata fields. Move source inputs into case directories only when the +benchmark starts accumulating bulky authoring resources: - The case has patches, hidden tests, oracle JSON, screenshots, reports, or fixture files. @@ -260,5 +263,5 @@ script. - Keep `metadata` snake_case because it crosses process and result boundaries. - Prefer `expected_output` for passive gold answers and `code-grader` for active commands, file checks, or generated artifact validation. -- Prefer case directories over long inline YAML once task artifacts become part - of the benchmark contract. +- Prefer case directories over long inline YAML only for bulky source inputs; + the generated run folder remains the portable artifact contract. diff --git a/apps/web/src/content/docs/docs/tools/dashboard.mdx b/apps/web/src/content/docs/docs/tools/dashboard.mdx index af29b089..f1df5f57 100644 --- a/apps/web/src/content/docs/docs/tools/dashboard.mdx +++ b/apps/web/src/content/docs/docs/tools/dashboard.mdx @@ -86,7 +86,7 @@ You can also set the same field globally in `$AGENTV_HOME/config.yaml` or `~/.ag ## Run Detail -Click any run to see a breakdown by suite, per-test scores, target, duration, and cost. The source label (`local` or `remote`) tells you where the run came from. +Click any run to see a breakdown by suite, per-test scores, target, duration, and cost. The source label (`local` or `remote`) tells you where the run came from. Files and source views resolve against the generated run artifacts referenced by `index.jsonl`—including per-result task bundles when present—so Dashboard does not require authors to create a separate source-side bundle structure. AgentV Dashboard run detail showing 100% pass rate across 5 tests with scores and duration diff --git a/apps/web/src/content/docs/docs/tools/results.mdx b/apps/web/src/content/docs/docs/tools/results.mdx index a00d9b67..46e05ef5 100644 --- a/apps/web/src/content/docs/docs/tools/results.mdx +++ b/apps/web/src/content/docs/docs/tools/results.mdx @@ -69,7 +69,7 @@ Use `results export` when you need the artifact workspace layout itself rather t agentv results export [--out ] ``` -This is useful when a manifest needs to be materialized into a predictable artifact tree for other tooling, review, or archiving. +This is useful when a manifest needs to be materialized into a predictable artifact tree for other tooling, review, or archiving. The run workspace is also where generated task bundles live: `index.jsonl` rows may point to per-result `task_dir`, `eval_path`, `targets_path`, `files_path`, and `graders_path` entries. Keep those generated artifacts with the run when sharing or auditing results. ## Inspection helpers