Skip to content

feat(authz): register agentex resources and grant agent/task ownership#270

Merged
asherfink merged 1 commit into
mainfrom
asher.fink/agx1-242-agent-dual-write
Jun 9, 2026
Merged

feat(authz): register agentex resources and grant agent/task ownership#270
asherfink merged 1 commit into
mainfrom
asher.fink/agx1-242-agent-dual-write

Conversation

@asherfink

@asherfink asherfink commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Related work

Parent ticket: AGX1-242 - AgentEx authorization dual-write.

This change is part of a 4-PR stack across 3 repos.

Repo PR Purpose
scaleapi/scaleapi scaleapi/scaleapi#146335 merged flag registry update for the shared AgentEx resources rollout flags
scaleapi/agentex scaleapi/agentex#364 agentex-auth routes each auth verb to one backend target
scaleapi/scale-agentex this PR registers agentex resources in the authorization graph and grants agent/task ownership
scaleapi/scale-agentex scaleapi/scale-agentex#271 denied direct agent access collapses to 404

Merge/deploy order: scaleapi/scaleapi#146335 is merged; deploy scaleapi/agentex#364 before this PR, then merge/deploy scaleapi/scale-agentex#271. Rollout per account is: enable fgac-agentex-resources-dual-write, backfill existing resources into Spark, then enable fgac-agentex-resources reads.

What

Registers agentex resources in the authorization graph on create and removes them on delete, separating resource lifecycle from explicit ownership:

  • Agents and tasks register in the authorization graph on create and record an explicit owner; on delete they deregister and the owner record is revoked.
  • Schedules and agent API keys register under their parent agent on create and deregister on delete. Permissions cascade from the parent agent, so no separate owner record is written.

Registration runs before the resource is persisted, so an authorization failure fails closed with no orphaned row. If the persist (or the Temporal create, for schedules) later fails, a compensating deregister runs.

Why

Giving each verb a single backend keeps the existing ownership writes (grant/revoke) unchanged and confines the new behavior to the new lifecycle interface (register/deregister). Explicit ownership is recorded only for agents and tasks; schedules and agent API keys inherit access from their parent agent, so they need no owner record of their own. Issuing the two write kinds independently keeps both backends current, so reads can cut over per account with a clean rollback.

Testing

  • Unit and integration tests covering the agent, task, schedule, and api_key authorization writes (register/deregister ordering, fail-closed-before-persist, and compensating cleanup).
  • ruff check / ruff format.

Greptile Summary

This PR adds register_resource / deregister_resource lifecycle calls to agents, tasks, schedules, and agent API keys so that each resource is tracked in the authorization graph from the moment it is created. register_resource fires before the Postgres (or Temporal) write so an auth failure fails closed with no orphaned row; a compensating deregister_resource cleans up if the downstream write later fails. Ownership writes (grant / revoke) remain at the route layer, keeping both backends current during the dual-write rollout.

  • AgentsUseCase gains _register_in_auth (called only on genuine first creates, not on updates/idempotent re-registers) and _safe_deregister (best-effort, called after soft-delete and as compensation for race conditions or persist failures).
  • ScheduleService and AgentAPIKeysUseCase already had this pattern; their docstrings and comments are updated to reflect the unified terminology.
  • principal_context is removed from agent schemas because the authorization service now derives the principal from its own injected context.

Confidence Score: 5/5

Safe to merge; register/deregister ordering is correct across all resource types and all failure paths are covered with compensating actions or best-effort swallowing.

The core invariant — register before persist so an auth failure aborts the create cleanly — holds in both AgentsUseCase create paths. Compensation on DuplicateItemError and generic Exception correctly deregisters only when a register call was actually made. The delete path deregisters after the soft-delete/Postgres delete, consistent with existing task and schedule patterns. Tests cover the full lifecycle including race conditions and the no-creator skip path.

No files require special attention.

Important Files Changed

Filename Overview
agentex/src/domain/use_cases/agents_use_case.py Adds DAuthorizationService dependency, _register_in_auth (skips when no principal, fail-closed before persist), and _safe_deregister (swallows errors). Both genuine-create paths call _register_in_auth with correct compensation on DuplicateItemError and generic Exception; update paths are correctly excluded.
agentex/src/api/routes/agents.py Removes per-resource post-fetch auth check from get_agent_by_name (pre-check dependency remains); removes principal_context forwarding; switches register_build from register_resource to grant at route level.
agentex/tests/integration/use_cases/test_agent_authz_dual_write.py New integration test covering all key scenarios: register before persist, fail-closed on register failure, compensation on persist failure, duplicate-race compensation, update path exclusion, no-creator skip, and delete deregistration.
agentex/tests/unit/api/test_tasks_authz.py Adds TestTaskDeleteAuthWrites covering delete_task and delete_task_by_name routes, verifying revoke is called with the correctly resolved task ID.
agentex/src/api/schemas/agents.py Removes principal_context field from RegisterAgentRequest and RegisterBuildRequest; reflected in openapi.yaml.

Reviews (9): Last reviewed commit: "feat(authz): register agentex resources ..." | Re-trigger Greptile

@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown

✱ Stainless preview builds

This PR will update the agentex-sdk SDKs with the following commit messages.

openapi

feat(api): remove principal_context field from agent models

python

feat(api): remove principal_context parameter from agents.register_build

typescript

fix(api): remove principal_context parameter from agent register build

Edit this comment to update them. They will appear in their respective SDK's changelogs.

agentex-sdk-typescript studio · code · diff

Your SDK build had at least one "warning" diagnostic, but this did not represent a regression.
generate ⚠️build ✅ (prev: build ⏭️) → lint ✅ (prev: lint ⏭️) → test ✅

npm install https://pkg.stainless.com/s/agentex-sdk-typescript/9af35c30b77687c7bb9994da378706b7856ffa79/dist.tar.gz
agentex-sdk-openapi studio · code · diff

Your SDK build had at least one "note" diagnostic, but this did not represent a regression.
generate ✅

⚠️ agentex-sdk-python studio · code · diff

Your SDK build had a failure in the lint CI job, which is a regression from the base state.
generate ⚠️build ✅ (prev: build ⏭️) → lint ❗ (prev: lint ⏭️) → test ✅

pip install https://pkg.stainless.com/s/agentex-sdk-python/16a061a6c1ca165d5dfdffecd3ad030f78d98b98/agentex_sdk-0.12.0-py3-none-any.whl

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-06-09 00:24:32 UTC

@asherfink asherfink force-pushed the asher.fink/agx1-242-agent-dual-write branch from 685126a to 0d99f31 Compare June 3, 2026 15:31
@asherfink asherfink changed the title feat(agents): register agents in the authorization graph on create/delete feat(agents): register/deregister agents in the authorization graph on create/delete Jun 3, 2026
@asherfink asherfink force-pushed the asher.fink/agx1-242-agent-dual-write branch from 0d99f31 to 14796e9 Compare June 3, 2026 15:35
"""Get an agent by its unique name."""
agent_entity = await agents_use_case.get(name=agent_name)

await authorization.check(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep the authorization.check/revoke/grant. Otherwise, how can we maintain backward compatibility?

1, removing authorization.check will remove authz altogether.
2, removing grant/revoke will break the old authz even if the user opts out of the new FGAC

Even if a user turns FGAC on, we may still want to keep grant/revoke for permission data dual-write, so that the old authz still works in case the user turns FGAC off.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

went and traced this through agentex-auth to make sure we're solid, and I don't think we need to keep grant/revoke here

tl;dr: on the legacy (SGP) backend, register_resource is grant and deregister_resource is revoke. so dropping the explicit grant/revoke and going through register/deregister is behavior-preserving for old authz, not a regression.

1. removing authorization.check removes authz altogether

it doesn't. the only check I removed is the explicit one in get_agent_by_name, but that route still has agent_name: DAuthorizedName(agent, read) in its signature, and that dependency calls authorization.check(agent, read) itself before the handler body runs (authorization_shortcuts.py). so the line I deleted was a redundant second check of the same tuple. read authz is still enforced. on create I kept the check(agent("*"), create) too, just dropped the unused principal_context= kwarg.

2 + 3. removing grant/revoke breaks old authz when FGAC is off

two things:

  • agentex-auth routes every call (grant/revoke/check/register/deregister) to exactly one provider per account based on FGAC_AGENTEX_AUTH_SPARK (_resolve_provider). flag off → legacy SGP, flag on → Spark. there's no write-to-both happening, it's a routing switch.
  • on the SGP gateway, register_resource delegates straight to grant and deregister_resource delegates to revoke

so for an FGAC-off account, register_resource(agent) → SGP → grant(owner), which is exactly what the old route did. deregister_resourcerevoke. legacy authz is fully preserved. for FGAC-on it does the proper Spark resource+owner write (tenant scoping comes from account_id automatically). either way it's correct, and keeping grant/revoke on top would just be a redundant duplicate write.

@asherfink asherfink force-pushed the asher.fink/agx1-242-agent-dual-write branch from 14796e9 to 883921e Compare June 5, 2026 17:07
@asherfink asherfink changed the title feat(agents): register/deregister agents in the authorization graph on create/delete feat(agents): write agent ownership to the authorization graph on create/delete Jun 5, 2026
@asherfink asherfink force-pushed the asher.fink/agx1-242-agent-dual-write branch 3 times, most recently from f1f7a42 to 7bc61d6 Compare June 8, 2026 18:30
@asherfink asherfink force-pushed the asher.fink/agx1-242-agent-dual-write branch from 7bc61d6 to 1dd4143 Compare June 8, 2026 20:28
@asherfink asherfink changed the title feat(agents): write agent ownership to the authorization graph on create/delete feat(agents): write legacy grants and Spark lifecycle auth Jun 8, 2026
@asherfink asherfink force-pushed the asher.fink/agx1-242-agent-dual-write branch 3 times, most recently from c00f3e9 to 4c4e73a Compare June 8, 2026 22:52
@asherfink asherfink changed the title feat(agents): write legacy grants and Spark lifecycle auth feat(authz): register agentex resources and grant agent/task ownership Jun 8, 2026
@scaleapi scaleapi deleted a comment from greptile-apps Bot Jun 8, 2026
@scaleapi scaleapi deleted a comment from greptile-apps Bot Jun 8, 2026
@asherfink

Copy link
Copy Markdown
Contributor Author

didn't mean to delete the greptile comments from^^, claude did that as they were from the old impl

Comment on lines -84 to -88
await authorization.check(
resource=AgentexResource.agent(agent_entity.id),
operation=AuthorizedOperationType.read,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we removing this? Does it now happen at a higher layer?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's moved up into the parameter dependency, so the body call was a duplicate. agent_name is typed DAuthorizedName(AgentexResourceType.agent, read), and that dependency (_ensure_authorized_name in src/utils/authorization_shortcuts.py) resolves the agent by name + runs authorization.check(agent(resolved_id), read) before the handler body executes.

get_agent_by_id directly above already works this way

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was duplicative even before this PR.

except ValueError as e:
raise HTTPException(status_code=400, detail=str(e)) from e
await authorization_service.register_resource(
await authorization_service.grant(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we do a register resource in agentex/src/domain/use_cases/agents_use_case.py and then a grant out here. Do we need to do both? Can we keep the authz checks/grants at the router layer ideally instead of nested in the business logic? Thoughts?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need both, they're the two backends of the dual-write, not redundant. register_resource records the agent in the new authz graph; grant records the legacy owner edge.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure makes sense but still can we move it up from the business layer? or move this call down. Ideally all authz checks and writes happen at the highest layer possible IMO

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree on the principle for the checks. gating belongs at the edge and it already is there: the check runs in the route and in the DAuthorizedId / DAuthorizedName param dependencies (which resolve before the handler body executes), so access control happens at the highest layer.

register_resource / deregister_resource are not authz gating. they're resource-graph writes that have to be transactionally consistent with the DB persist:

  • run before the row is persisted (fail-closed → no orphaned row),
  • compensate with a deregister if the persist loses a duplicate-create race or fails,
  • fire only on a genuine create (the get-or-create logic in the use case is what knows create-vs-update; re-registering on an update would rewrite ownership).

The route only gets control after register_agent returns (post-persist), and the agent id is generated inside it. so it can't do register-before-persist or know create-vs-update without exposing the transaction internals upward. Moving it to the router would pull persistence/race logic into the route, which is worse

@asherfink asherfink enabled auto-merge (squash) June 9, 2026 00:20
@asherfink asherfink force-pushed the asher.fink/agx1-242-agent-dual-write branch from 4c4e73a to 73a5056 Compare June 9, 2026 00:21
@asherfink

Copy link
Copy Markdown
Contributor Author

rebased onto latest main

@asherfink asherfink merged commit 76353de into main Jun 9, 2026
30 checks passed
@asherfink asherfink deleted the asher.fink/agx1-242-agent-dual-write branch June 9, 2026 00:27
asherfink added a commit that referenced this pull request Jun 9, 2026
#271)

## Related work

Parent ticket: AGX1-242 - AgentEx authorization dual-write.

This change is part of a 4-PR stack across 3 repos.

| Repo | PR | Purpose |
|---|---|---|
| scaleapi/scaleapi |
[scaleapi/scaleapi#146335](scaleapi/scaleapi#146335)
| merged flag registry update for the shared AgentEx resources rollout
flags |
| scaleapi/agentex |
[scaleapi/agentex#364](scaleapi/agentex#364) |
agentex-auth routes each auth verb to one backend target |
| scaleapi/scale-agentex |
[#270](#270)
| registers agentex resources in the authorization graph and grants
agent/task ownership |
| **scaleapi/scale-agentex** | **this PR** | **denied direct agent
access collapses to 404** |

**Merge/deploy order:**
[scaleapi/scaleapi#146335](scaleapi/scaleapi#146335)
is merged; deploy
[scaleapi/agentex#364](scaleapi/agentex#364),
then merge/deploy
[#270](#270),
then this PR. Rollout per account is: enable
`fgac-agentex-resources-dual-write`, backfill existing resources into
Spark, then enable `fgac-agentex-resources` reads.

## What

Collapses denied direct agent access from 403 to 404 for agent lookups
by id, name, and query. Allowed checks pass through unchanged, and list
behavior is unchanged (the list route already filters to authorized
ids).

## Why

Returning 403 for a specific agent id or name reveals that the agent
exists; returning 404 makes "not found" and "exists but not visible to
this caller" indistinguishable. This is read-side behavior only.

## Testing

- Unit tests for the agent and task authorization route behavior.

<!-- greptile_comment -->

<h3>Greptile Summary</h3>

This PR extends the authorization shortcut helpers to collapse denied
agent checks from 403 to 404, preventing cross-tenant existence probing
via error-code differentiation.

- Adds `_check_agent_or_collapse_to_404` and wires it into all three
shortcuts (`DAuthorizedId`, `DAuthorizedQuery`, `DAuthorizedName`) for
`AgentexResourceType.agent`; the earlier review concern about
`DAuthorizedQuery` (`GET /events` list endpoint) is addressed in this
diff.
- Introduces 263-line unit test suite (`test_agents_authz.py`) covering
the helper directly plus each shortcut's agent path, and updates
`test_schedules_authz.py` / `test_tasks_authz.py` to reflect the new
expected behavior.

<details><summary><h3>Confidence Score: 5/5</h3></summary>

Safe to merge; all three agent-lookup shortcuts collapse denied checks
to 404 and the earlier concern about DAuthorizedQuery on the events list
endpoint is resolved in this diff.

The change is narrow and consistent: one new helper function wired into
three existing shortcuts, each path exercised by dedicated unit tests.
DAuthorizedBodyId is only used for task resources so it requires no
changes. The updated schedule and task tests correctly reflect the new
collapse behavior.

No files require special attention.
</details>

<h3>Important Files Changed</h3>

| Filename | Overview |
|----------|----------|
| agentex/src/utils/authorization_shortcuts.py | Adds
_check_agent_or_collapse_to_404 and applies it in DAuthorizedId,
DAuthorizedQuery, and DAuthorizedName; all three agent-lookup paths now
collapse AuthorizationError to 404. DAuthorizedBodyId is unaffected and
only used with task resources. |
| agentex/tests/unit/api/test_agents_authz.py | New test file covering
the helper, DAuthorizedId, DAuthorizedQuery, DAuthorizedName, and
list-filtering for agent resources; denied/absent/allowed cases are all
represented. |
| agentex/tests/unit/api/test_schedules_authz.py | Updates
TestCreateParentAgentCheck to expect ItemDoesNotExist (404) instead of
AuthorizationError (403) on a denied parent-agent check, consistent with
the new behavior in DAuthorizedId for agents. |
| agentex/tests/unit/api/test_tasks_authz.py | Renames
test_non_task_name_skips_wrap to test_agent_name_collapses_to_404 and
flips the expected exception from AuthorizationError to
ItemDoesNotExist, aligning with the agent-name collapse now in
DAuthorizedName. |

</details>

<details><summary><h3>Flowchart</h3></summary>

```mermaid
%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Caller Request] --> B{Authorization Shortcut}

    B -->|DAuthorizedId agent| C[_check_agent_or_collapse_to_404]
    B -->|DAuthorizedName agent| D[repo.get name to id then _check_agent_or_collapse_to_404]
    B -->|DAuthorizedQuery agent| E[_check_agent_or_collapse_to_404]

    C --> F{authorization.check}
    D --> F
    E --> F

    F -->|Allowed| G[Pass through - return resource id]
    F -->|AuthorizationError denied| H[raise ItemDoesNotExist - HTTP 404]

    style H fill:#f96,color:#000
    style G fill:#6f6,color:#000
```
</details>

<sub>Reviews (10): Last reviewed commit: ["feat(agents): collapse denied
agent
rout..."](99106f8)
| [Re-trigger
Greptile](https://app.greptile.com/api/retrigger?id=35397402)</sub>

<!-- /greptile_comment -->
harvhan added a commit that referenced this pull request Jun 9, 2026
Deployed agent pods were CrashLoopBackOff-ing -- 422 on
POST /agents/register. The route is whitelisted (pods self-register on
startup with no user login), so #270's added create-check/grant ran
with a None principal, which agentex-auth rejects with 422.

Changes:
- Skip the create check and ownership grant in register_agent when no
  creator is resolvable from the principal context; an authenticated
  caller is still enforced. The agent is already owned from build time
  (register-build runs before deploy-time register).
- Align the legacy-authz integration test (which mocked agentex-auth and
  masked the crashloop) to assert no check/grant on the self-reg path,
  and add unit coverage for the resolvable-creator gate.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants