feat(authz): register agentex resources and grant agent/task ownership#270
Conversation
✱ Stainless preview buildsThis PR will update the openapi python typescript Edit this comment to update them. They will appear in their respective SDK's changelogs. ✅ agentex-sdk-typescript studio · code · diff
✅ agentex-sdk-openapi studio · code · diff
|
685126a to
0d99f31
Compare
0d99f31 to
14796e9
Compare
| """Get an agent by its unique name.""" | ||
| agent_entity = await agents_use_case.get(name=agent_name) | ||
|
|
||
| await authorization.check( |
There was a problem hiding this comment.
I think we should keep the authorization.check/revoke/grant. Otherwise, how can we maintain backward compatibility?
1, removing authorization.check will remove authz altogether.
2, removing grant/revoke will break the old authz even if the user opts out of the new FGAC
Even if a user turns FGAC on, we may still want to keep grant/revoke for permission data dual-write, so that the old authz still works in case the user turns FGAC off.
There was a problem hiding this comment.
went and traced this through agentex-auth to make sure we're solid, and I don't think we need to keep grant/revoke here
tl;dr: on the legacy (SGP) backend, register_resource is grant and deregister_resource is revoke. so dropping the explicit grant/revoke and going through register/deregister is behavior-preserving for old authz, not a regression.
1. removing authorization.check removes authz altogether
it doesn't. the only check I removed is the explicit one in get_agent_by_name, but that route still has agent_name: DAuthorizedName(agent, read) in its signature, and that dependency calls authorization.check(agent, read) itself before the handler body runs (authorization_shortcuts.py). so the line I deleted was a redundant second check of the same tuple. read authz is still enforced. on create I kept the check(agent("*"), create) too, just dropped the unused principal_context= kwarg.
2 + 3. removing grant/revoke breaks old authz when FGAC is off
two things:
agentex-authroutes every call (grant/revoke/check/register/deregister) to exactly one provider per account based onFGAC_AGENTEX_AUTH_SPARK(_resolve_provider). flag off → legacy SGP, flag on → Spark. there's no write-to-both happening, it's a routing switch.- on the SGP gateway,
register_resourcedelegates straight tograntandderegister_resourcedelegates torevoke
so for an FGAC-off account, register_resource(agent) → SGP → grant(owner), which is exactly what the old route did. deregister_resource → revoke. legacy authz is fully preserved. for FGAC-on it does the proper Spark resource+owner write (tenant scoping comes from account_id automatically). either way it's correct, and keeping grant/revoke on top would just be a redundant duplicate write.
14796e9 to
883921e
Compare
f1f7a42 to
7bc61d6
Compare
7bc61d6 to
1dd4143
Compare
c00f3e9 to
4c4e73a
Compare
|
didn't mean to delete the greptile comments from^^, claude did that as they were from the old impl |
| await authorization.check( | ||
| resource=AgentexResource.agent(agent_entity.id), | ||
| operation=AuthorizedOperationType.read, | ||
| ) | ||
|
|
There was a problem hiding this comment.
Why are we removing this? Does it now happen at a higher layer?
There was a problem hiding this comment.
it's moved up into the parameter dependency, so the body call was a duplicate. agent_name is typed DAuthorizedName(AgentexResourceType.agent, read), and that dependency (_ensure_authorized_name in src/utils/authorization_shortcuts.py) resolves the agent by name + runs authorization.check(agent(resolved_id), read) before the handler body executes.
get_agent_by_id directly above already works this way
There was a problem hiding this comment.
this was duplicative even before this PR.
| except ValueError as e: | ||
| raise HTTPException(status_code=400, detail=str(e)) from e | ||
| await authorization_service.register_resource( | ||
| await authorization_service.grant( |
There was a problem hiding this comment.
So we do a register resource in agentex/src/domain/use_cases/agents_use_case.py and then a grant out here. Do we need to do both? Can we keep the authz checks/grants at the router layer ideally instead of nested in the business logic? Thoughts?
There was a problem hiding this comment.
we need both, they're the two backends of the dual-write, not redundant. register_resource records the agent in the new authz graph; grant records the legacy owner edge.
There was a problem hiding this comment.
sure makes sense but still can we move it up from the business layer? or move this call down. Ideally all authz checks and writes happen at the highest layer possible IMO
There was a problem hiding this comment.
agree on the principle for the checks. gating belongs at the edge and it already is there: the check runs in the route and in the DAuthorizedId / DAuthorizedName param dependencies (which resolve before the handler body executes), so access control happens at the highest layer.
register_resource / deregister_resource are not authz gating. they're resource-graph writes that have to be transactionally consistent with the DB persist:
- run before the row is persisted (fail-closed → no orphaned row),
- compensate with a
deregisterif the persist loses a duplicate-create race or fails, - fire only on a genuine create (the get-or-create logic in the use case is what knows create-vs-update; re-registering on an update would rewrite ownership).
The route only gets control after register_agent returns (post-persist), and the agent id is generated inside it. so it can't do register-before-persist or know create-vs-update without exposing the transaction internals upward. Moving it to the router would pull persistence/race logic into the route, which is worse
4c4e73a to
73a5056
Compare
|
rebased onto latest |
#271) ## Related work Parent ticket: AGX1-242 - AgentEx authorization dual-write. This change is part of a 4-PR stack across 3 repos. | Repo | PR | Purpose | |---|---|---| | scaleapi/scaleapi | [scaleapi/scaleapi#146335](scaleapi/scaleapi#146335) | merged flag registry update for the shared AgentEx resources rollout flags | | scaleapi/agentex | [scaleapi/agentex#364](scaleapi/agentex#364) | agentex-auth routes each auth verb to one backend target | | scaleapi/scale-agentex | [#270](#270) | registers agentex resources in the authorization graph and grants agent/task ownership | | **scaleapi/scale-agentex** | **this PR** | **denied direct agent access collapses to 404** | **Merge/deploy order:** [scaleapi/scaleapi#146335](scaleapi/scaleapi#146335) is merged; deploy [scaleapi/agentex#364](scaleapi/agentex#364), then merge/deploy [#270](#270), then this PR. Rollout per account is: enable `fgac-agentex-resources-dual-write`, backfill existing resources into Spark, then enable `fgac-agentex-resources` reads. ## What Collapses denied direct agent access from 403 to 404 for agent lookups by id, name, and query. Allowed checks pass through unchanged, and list behavior is unchanged (the list route already filters to authorized ids). ## Why Returning 403 for a specific agent id or name reveals that the agent exists; returning 404 makes "not found" and "exists but not visible to this caller" indistinguishable. This is read-side behavior only. ## Testing - Unit tests for the agent and task authorization route behavior. <!-- greptile_comment --> <h3>Greptile Summary</h3> This PR extends the authorization shortcut helpers to collapse denied agent checks from 403 to 404, preventing cross-tenant existence probing via error-code differentiation. - Adds `_check_agent_or_collapse_to_404` and wires it into all three shortcuts (`DAuthorizedId`, `DAuthorizedQuery`, `DAuthorizedName`) for `AgentexResourceType.agent`; the earlier review concern about `DAuthorizedQuery` (`GET /events` list endpoint) is addressed in this diff. - Introduces 263-line unit test suite (`test_agents_authz.py`) covering the helper directly plus each shortcut's agent path, and updates `test_schedules_authz.py` / `test_tasks_authz.py` to reflect the new expected behavior. <details><summary><h3>Confidence Score: 5/5</h3></summary> Safe to merge; all three agent-lookup shortcuts collapse denied checks to 404 and the earlier concern about DAuthorizedQuery on the events list endpoint is resolved in this diff. The change is narrow and consistent: one new helper function wired into three existing shortcuts, each path exercised by dedicated unit tests. DAuthorizedBodyId is only used for task resources so it requires no changes. The updated schedule and task tests correctly reflect the new collapse behavior. No files require special attention. </details> <h3>Important Files Changed</h3> | Filename | Overview | |----------|----------| | agentex/src/utils/authorization_shortcuts.py | Adds _check_agent_or_collapse_to_404 and applies it in DAuthorizedId, DAuthorizedQuery, and DAuthorizedName; all three agent-lookup paths now collapse AuthorizationError to 404. DAuthorizedBodyId is unaffected and only used with task resources. | | agentex/tests/unit/api/test_agents_authz.py | New test file covering the helper, DAuthorizedId, DAuthorizedQuery, DAuthorizedName, and list-filtering for agent resources; denied/absent/allowed cases are all represented. | | agentex/tests/unit/api/test_schedules_authz.py | Updates TestCreateParentAgentCheck to expect ItemDoesNotExist (404) instead of AuthorizationError (403) on a denied parent-agent check, consistent with the new behavior in DAuthorizedId for agents. | | agentex/tests/unit/api/test_tasks_authz.py | Renames test_non_task_name_skips_wrap to test_agent_name_collapses_to_404 and flips the expected exception from AuthorizationError to ItemDoesNotExist, aligning with the agent-name collapse now in DAuthorizedName. | </details> <details><summary><h3>Flowchart</h3></summary> ```mermaid %%{init: {'theme': 'neutral'}}%% flowchart TD A[Caller Request] --> B{Authorization Shortcut} B -->|DAuthorizedId agent| C[_check_agent_or_collapse_to_404] B -->|DAuthorizedName agent| D[repo.get name to id then _check_agent_or_collapse_to_404] B -->|DAuthorizedQuery agent| E[_check_agent_or_collapse_to_404] C --> F{authorization.check} D --> F E --> F F -->|Allowed| G[Pass through - return resource id] F -->|AuthorizationError denied| H[raise ItemDoesNotExist - HTTP 404] style H fill:#f96,color:#000 style G fill:#6f6,color:#000 ``` </details> <sub>Reviews (10): Last reviewed commit: ["feat(agents): collapse denied agent rout..."](99106f8) | [Re-trigger Greptile](https://app.greptile.com/api/retrigger?id=35397402)</sub> <!-- /greptile_comment -->
Deployed agent pods were CrashLoopBackOff-ing -- 422 on POST /agents/register. The route is whitelisted (pods self-register on startup with no user login), so #270's added create-check/grant ran with a None principal, which agentex-auth rejects with 422. Changes: - Skip the create check and ownership grant in register_agent when no creator is resolvable from the principal context; an authenticated caller is still enforced. The agent is already owned from build time (register-build runs before deploy-time register). - Align the legacy-authz integration test (which mocked agentex-auth and masked the crashloop) to assert no check/grant on the self-reg path, and add unit coverage for the resolvable-creator gate.
Related work
Parent ticket: AGX1-242 - AgentEx authorization dual-write.
This change is part of a 4-PR stack across 3 repos.
Merge/deploy order: scaleapi/scaleapi#146335 is merged; deploy scaleapi/agentex#364 before this PR, then merge/deploy scaleapi/scale-agentex#271. Rollout per account is: enable
fgac-agentex-resources-dual-write, backfill existing resources into Spark, then enablefgac-agentex-resourcesreads.What
Registers agentex resources in the authorization graph on create and removes them on delete, separating resource lifecycle from explicit ownership:
Registration runs before the resource is persisted, so an authorization failure fails closed with no orphaned row. If the persist (or the Temporal create, for schedules) later fails, a compensating deregister runs.
Why
Giving each verb a single backend keeps the existing ownership writes (
grant/revoke) unchanged and confines the new behavior to the new lifecycle interface (register/deregister). Explicit ownership is recorded only for agents and tasks; schedules and agent API keys inherit access from their parent agent, so they need no owner record of their own. Issuing the two write kinds independently keeps both backends current, so reads can cut over per account with a clean rollback.Testing
ruff check/ruff format.Greptile Summary
This PR adds
register_resource/deregister_resourcelifecycle calls to agents, tasks, schedules, and agent API keys so that each resource is tracked in the authorization graph from the moment it is created.register_resourcefires before the Postgres (or Temporal) write so an auth failure fails closed with no orphaned row; a compensatingderegister_resourcecleans up if the downstream write later fails. Ownership writes (grant/revoke) remain at the route layer, keeping both backends current during the dual-write rollout._register_in_auth(called only on genuine first creates, not on updates/idempotent re-registers) and_safe_deregister(best-effort, called after soft-delete and as compensation for race conditions or persist failures).principal_contextis removed from agent schemas because the authorization service now derives the principal from its own injected context.Confidence Score: 5/5
Safe to merge; register/deregister ordering is correct across all resource types and all failure paths are covered with compensating actions or best-effort swallowing.
The core invariant — register before persist so an auth failure aborts the create cleanly — holds in both AgentsUseCase create paths. Compensation on DuplicateItemError and generic Exception correctly deregisters only when a register call was actually made. The delete path deregisters after the soft-delete/Postgres delete, consistent with existing task and schedule patterns. Tests cover the full lifecycle including race conditions and the no-creator skip path.
No files require special attention.
Important Files Changed
Reviews (9): Last reviewed commit: "feat(authz): register agentex resources ..." | Re-trigger Greptile