feat(retention): scheduled task-retention cleanup workflow#272
Merged
Conversation
Adds RETENTION_CLEANUP_ENABLED, RETENTION_CLEANUP_AGENT_ALLOWLIST, RETENTION_CLEANUP_IDLE_DAYS, RETENTION_CLEANUP_CRON, RETENTION_CLEANUP_PAGE_SIZE, and RETENTION_CLEANUP_MAX_IN_FLIGHT to EnvVarKeys, EnvironmentVariables, and the refresh() parser, following the existing ENABLE_HEALTH_CHECK_WORKFLOW pattern. Includes unit tests covering both explicit values and defaults.
…o schedule Add a load_cleanup_config activity that reads RETENTION_CLEANUP_* env vars at sweep runtime so changing policy and restarting the worker takes effect on the next scheduled run without recreating the Temporal Schedule. The schedule now carries no policy args — only cron + workflow identity. The sweep loads config once on the first page and carries it across continue_as_new pages for consistency. The allowlist gate in the schedule bootstrap is removed; fail-closed is preserved at runtime (empty allowlist → discovery returns no candidates).
Merge the health-check and retention-cleanup Temporal workers into one process/container. Both sets of workflows and activities are now registered on the same `agentex-server` task queue worker, preventing tasks from landing on a worker that has no handler for their type. Remove the separate `agentex-retention-cleanup-worker` compose service and `run_retention_cleanup_worker_main` entrypoint; add the four runtime-policy env vars to the surviving `agentex-temporal-worker` service.
…runtime enabled gate
- carry the enabled flag across continue_as_new pages so the runtime kill switch stays consistent with the other carried policy fields - scope child workflow ids with the sweep run id to avoid collisions under a REJECT_DUPLICATE workflow-id-reuse policy - add a task_repository property to TaskRetentionUseCase instead of reaching through retention_service internals in the worker
7bd0643 to
45802a7
Compare
Collaborator
Author
|
discussed and this is just going in as-is since the difference in having another agentex member stamp is negligible and this change should be a no-op on deployment as any clean up is gated behind an enabled as well as a dry run flag |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a regularly scheduled sweep that cleans up long-lived task/chat data for idle tasks, to support project-data-isolation retention requirements. A Temporal Schedule fires a
RetentionCleanupSweepWorkflowthat discovers idle tasks belonging to an allowlisted set of agents and runs the existing, idempotentTaskRetentionUseCase.clean_taskpath against each one.cleaned_at IS NULL AND updated_at < cutoff, joined to an agent-name allowlist, keyset-paginated. Deliberately no status filter — status is race-prone, so the authoritative RUNNING/idle/unprocessed-events checks stay insideclean_taskat clean-time.updated_at < cutoffis a correct superset of truly-idle tasks (never under-includes).RetentionCleanupTaskWorkflowchild per task (bounded bymax_in_flight), aggregated into cleaned/skipped/failed totals;continue_as_newper page bounds history.clean_task'sClientErrorrefusals to askippedoutcome (so one task never aborts the sweep); genuine transient errors propagate and are retried by Temporal.load_cleanup_configactivity at sweep run time (carried across pages), so changing the allowlist is an env edit + worker restart — no schedule recreation. Cron remains a schedule property.RETENTION_CLEANUP_ENABLED(off by default); fail-closed empty allowlist cleans nothing; clean-only (no export sink in this iteration).Reuses the already-merged export/clean/rehydrate path — no duplicated deletion logic.
Config (env vars)
RETENTION_CLEANUP_ENABLED(default false),RETENTION_CLEANUP_AGENT_ALLOWLIST(CSV of agent names, empty ⇒ no-op),RETENTION_CLEANUP_IDLE_DAYS(7),RETENTION_CLEANUP_CRON(0 4 * * *),RETENTION_CLEANUP_PAGE_SIZE(200),RETENTION_CLEANUP_MAX_IN_FLIGHT(20).Test Plan
Notes / follow-ups
agentex-servertask queue in the deployed environment for the schedule to fire. This PR wires the worker for local docker-compose only.Greptile Summary
This PR introduces a scheduled Temporal sweep that discovers idle tasks belonging to an agent allowlist and runs the existing idempotent
TaskRetentionUseCase.clean_taskpath against each one, with a safe dry-run default and acontinue_as_newkeyset-pagination pattern to keep workflow history bounded.list_cleanup_candidate_idsuses a DISTINCT keyset-paginated query (no status filter, by design) to find candidates;RetentionCleanupSweepWorkflowfans out oneRetentionCleanupTaskWorkflowchild per task (bounded bymax_in_flight), aggregates cleaned/skipped/failed totals across pages, and skips multi-agent tasks via a preflight query.load_cleanup_configreads allRETENTION_CLEANUP_*env vars at sweep time so allowlist/idle-days changes take effect on the next run without schedule recreation; the cron expression is the only property baked into the Temporal Schedule.RETENTION_CLEANUP_ENABLED=false), dry-run is on by default (RETENTION_CLEANUP_DRY_RUN=true), and an empty allowlist is fail-closed (returns no candidates).Confidence Score: 4/5
Safe to merge after addressing the DRY_RUN boolean comparison; all deletion paths default to no-op and the feature is off by default.
The boolean parsing for RETENTION_CLEANUP_DRY_RUN is case-sensitive: an operator who sets the env var to 'True' (capital T — common from YAML tools and Python config generators) gets dry_run=False, enabling live deletions while believing they are in dry-run mode. Every other aspect of the PR is well-structured: the sweep logic is correct, the fail-closed defaults are right, the tests are thorough, and the keyset pagination is safe.
agentex/src/config/environment_variables.py — the RETENTION_CLEANUP_DRY_RUN boolean comparison; agentex/src/domain/services/task_retention_service.py — preview_clean_task duplicates the clean_task safety guards.
Important Files Changed
Prompt To Fix All With AI
Reviews (8): Last reviewed commit: "fix(retention): default cleanup dry-run ..." | Re-trigger Greptile