From f1e43a1db44b1aed3c1e2416992a1c293fb9427e Mon Sep 17 00:00:00 2001
From: Christopher Tso <christso@gmail.com>
Date: Wed, 10 Jun 2026 12:33:48 +0200
Subject: [PATCH] docs(run): document generated task bundles

---
 .../docs/docs/evaluation/running-evals.mdx    | 36 ++++++++++++++++++-
 .../docs/docs/guides/benchmark-provenance.mdx | 31 ++++++++--------
 .../src/content/docs/docs/tools/dashboard.mdx |  2 +-
 .../src/content/docs/docs/tools/results.mdx   |  2 +-
 4 files changed, 54 insertions(+), 17 deletions(-)
diff --git a/apps/web/src/content/docs/docs/evaluation/running-evals.mdx b/apps/web/src/content/docs/docs/evaluation/running-evals.mdx
index 2bf3e561..08b053ce 100644
--- a/apps/web/src/content/docs/docs/evaluation/running-evals.mdx
+++ b/apps/web/src/content/docs/docs/evaluation/running-evals.mdx
@@ -11,7 +11,7 @@ sidebar:
 agentv eval evals/my-eval.yaml
 ```
 
-Results are written to `.agentv/results/runs/<timestamp>/index.jsonl`. Each line is a JSON object with one result per test case, and the run workspace also stores the manifest and related artifacts.
+Results are written to `.agentv/results/runs/<timestamp>/index.jsonl`. Each line is a JSON object with one result per test case, and the run workspace also stores the manifest and related artifacts. Use this generated run folder as the portable audit surface: copy or sync the run directory, not a hand-authored parallel bundle.
 
 Each `scores[]` entry includes per-grader timing:
 
@@ -89,6 +89,40 @@ agentv eval evals/my-eval.yaml --output ./my-results
 `--output` is a run directory, not a file path. The canonical manifest is always
 `<output>/index.jsonl`.
 
+### Generated Task Bundles
+
+Each result can also include a generated task bundle inside its per-test artifact
+directory. The bundle captures the eval slice and target settings that produced
+that row, so reviewers and rerun tooling can inspect the exact run-local source
+instead of relying on a mutable checkout.
+
+Typical layout:
+
+```text
+my-results/
+  index.jsonl
+  benchmark.json
+  <test-id>/
+    grading.json
+    timing.json
+    input.md
+    outputs/response.md
+    task/
+      EVAL.yaml
+      targets.yaml
+      files/      # copied input files when the case references them
+      graders/    # copied grader prompt/script files when applicable
+```
+
+The `index.jsonl` row links to these generated paths with snake_case fields such
+as `artifact_dir`, `task_dir`, `eval_path`, `targets_path`, `files_path`, and
+`graders_path`. Treat those paths as relative to the run directory. When you need
+a portable artifact for audit, review, Dashboard inspection, or rerun workflows,
+share the generated run directory and its `index.jsonl` manifest. Source-side
+case directories are still useful for organizing bulky prompts, fixtures, or
+tests while authoring an eval, but they are optional input organization rather
+than a separate artifact schema.
+
 ### Export Additional Formats
 
 Write additional output files alongside the artifact directory. Format is inferred from the file extension (`.jsonl`, `.json`, `.xml`, `.yaml`, `.html`):
diff --git a/apps/web/src/content/docs/docs/guides/benchmark-provenance.mdx b/apps/web/src/content/docs/docs/guides/benchmark-provenance.mdx
index 7c07d4b7..704dc05f 100644
--- a/apps/web/src/content/docs/docs/guides/benchmark-provenance.mdx
+++ b/apps/web/src/content/docs/docs/guides/benchmark-provenance.mdx
@@ -12,7 +12,8 @@ verification commands. AgentV represents that with existing primitives:
 - Put runtime behavior in `workspace`, `execution`, `input`, `expected_output`,
   and `assertions`.
 - Put provenance and classification in per-case `metadata`.
-- Put bulky per-case artifacts in case directories and supporting files.
+- Put bulky per-case authoring inputs in optional case directories and supporting files.
+- Use generated run folders, not hand-authored source bundles, as the portable audit artifact.
 
 These are documentation patterns, not special runtime schema keys. AgentV does
 not interpret keys such as `source_commit`, `test_patch`, or `question_type`
@@ -64,7 +65,7 @@ on per-case metadata such as a patch path, source row, or selected test list.
 
 ## Task Artifact Anatomy
 
-Benchmark task packs map cleanly onto AgentV fields:
+Benchmark task packs map cleanly onto AgentV fields at authoring time:
 
 | Task artifact | AgentV pattern |
 |---------------|----------------|
@@ -74,14 +75,16 @@ Benchmark task packs map cleanly onto AgentV fields:
 | Gold answer | `expected_output` when the answer is passive reference data |
 | Active verification | `assertions`, especially `code-grader` for commands or artifact checks |
 | Provenance | `tests[].metadata` with source pins, generator rows, and curation labels |
-| Bulky task files | `tests: ./cases/` with per-case directories and supporting files |
+| Bulky task files | Optional `tests: ./cases/` with per-case directories and supporting files |
 
-This mirrors the common task shape used by filesystem-native benchmark harnesses:
-Margin keeps each task's prompt, case metadata, tests, environment, and optional
-oracle in a case directory; Terminal-Bench and Harbor keep task instructions,
-container setup, run-test scripts, and result artifacts as separate files. In
-AgentV, keep the same separation but bind it with eval YAML instead of adding a
-large benchmark-specific schema.
+Use this separation only when it makes the source eval easier to maintain. It is
+not a first-class artifact schema. After an eval runs, AgentV writes the portable
+audit surface into the generated run folder: each result can link from
+`index.jsonl` to a run-local `task/` bundle containing `EVAL.yaml`,
+`targets.yaml`, and copied `files/` or `graders/` snapshots where applicable.
+Review, Dashboard files views, and rerun workflows should inspect those generated
+run artifacts instead of requiring authors to maintain a parallel source-side
+bundle layout. See [Generated Task Bundles](/docs/evaluation/running-evals/#generated-task-bundles).
 
 ## SWE-Style Case
 
@@ -181,11 +184,11 @@ regeneration checks. If a hook or grader needs the source file at runtime, clone
 it through `workspace.repos` or make the generator output available as a normal
 fixture file.
 
-## When to Split Into Case Directories
+## Optional Source-Side Case Directories
 
 Inline YAML is fine when a case has a short prompt, a short expected answer, and
-a few metadata fields. Move away from inline YAML when the benchmark starts
-accumulating task-local artifacts:
+a few metadata fields. Move source inputs into case directories only when the
+benchmark starts accumulating bulky authoring resources:
 
 - The case has patches, hidden tests, oracle JSON, screenshots, reports, or
   fixture files.
@@ -260,5 +263,5 @@ script.
 - Keep `metadata` snake_case because it crosses process and result boundaries.
 - Prefer `expected_output` for passive gold answers and `code-grader` for active
   commands, file checks, or generated artifact validation.
-- Prefer case directories over long inline YAML once task artifacts become part
-  of the benchmark contract.
+- Prefer case directories over long inline YAML only for bulky source inputs;
+  the generated run folder remains the portable artifact contract.
diff --git a/apps/web/src/content/docs/docs/tools/dashboard.mdx b/apps/web/src/content/docs/docs/tools/dashboard.mdx
index af29b089..f1df5f57 100644
--- a/apps/web/src/content/docs/docs/tools/dashboard.mdx
+++ b/apps/web/src/content/docs/docs/tools/dashboard.mdx
@@ -86,7 +86,7 @@ You can also set the same field globally in `$AGENTV_HOME/config.yaml` or `~/.ag
 
 ## Run Detail
 
-Click any run to see a breakdown by suite, per-test scores, target, duration, and cost. The source label (`local` or `remote`) tells you where the run came from.
+Click any run to see a breakdown by suite, per-test scores, target, duration, and cost. The source label (`local` or `remote`) tells you where the run came from. Files and source views resolve against the generated run artifacts referenced by `index.jsonl`—including per-result task bundles when present—so Dashboard does not require authors to create a separate source-side bundle structure.
 
 <Image src={studioRunDetail} alt="AgentV Dashboard run detail showing 100% pass rate across 5 tests with scores and duration" />
 
diff --git a/apps/web/src/content/docs/docs/tools/results.mdx b/apps/web/src/content/docs/docs/tools/results.mdx
index a00d9b67..46e05ef5 100644
--- a/apps/web/src/content/docs/docs/tools/results.mdx
+++ b/apps/web/src/content/docs/docs/tools/results.mdx
@@ -69,7 +69,7 @@ Use `results export` when you need the artifact workspace layout itself rather t
 agentv results export <run-workspace-or-index.jsonl> [--out <dir>]
 ```
 
-This is useful when a manifest needs to be materialized into a predictable artifact tree for other tooling, review, or archiving.
+This is useful when a manifest needs to be materialized into a predictable artifact tree for other tooling, review, or archiving. The run workspace is also where generated task bundles live: `index.jsonl` rows may point to per-result `task_dir`, `eval_path`, `targets_path`, `files_path`, and `graders_path` entries. Keep those generated artifacts with the run when sharing or auditing results.
 
 ## Inspection helpers