feat(supervisor): add opt-in dequeue backpressure#3836
Conversation
Cached, fail-open monitor that decides whether to skip dequeues based on a pluggable signal source. Disabled is a total no-op (no refresh, no reads). The hot-path read is synchronous and never performs I/O; every failure mode (source throws, returns null, or verdict goes stale) fails open.
Reads the backpressure verdict from a Redis key (written by the cluster-side aggregator). Malformed or wrong-shaped values are treated as unknown so the monitor fails open. Adds @internal/redis + @internal/testcontainers deps.
Gate dequeues on the backpressure verdict via the existing preDequeue hook, on all paths including k8s (where the resource monitor is a no-op). Construct the Redis-backed monitor only when TRIGGER_DEQUEUE_BACKPRESSURE_ENABLED is set; require a Redis host when enabled. Off by default - no Redis client, no effect.
isEngaged() exposes the hard backpressure state (drives scale-up freeze), while shouldSkipDequeue() additionally ramps after release - skipping a linearly-decaying fraction of attempts over rampMs so the aggregate dequeue rate climbs back to full instead of snapping and re-flooding the cluster.
Add optional shouldPauseScaling to ScalingOptions; when it returns true the pool stops scaling up (scale-down still allowed), so it won't add consumers to drain a queue backpressure is deliberately holding.
Pass shouldPauseScaling (monitor.isEngaged) into the consumer pool so scale-up freezes while hard-engaged, and feed TRIGGER_DEQUEUE_BACKPRESSURE_RAMP_MS into the monitor's post-release ramp. Off by default.
Dry-run (default on via env) keeps the gates inert while computeEngaged still reflects the real signal and verdict transitions are logged. Adds BackpressureMetrics (engaged/dry_run gauges, skipped-dequeues counter).
🦋 Changeset detectedLatest commit: 44842f9 The changes in this PR will be included in the next version bump. This PR includes changesets to release 25 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📜 Recent review details⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (49)
WalkthroughThis pull request implements a Redis-backed dequeue backpressure system for the supervisor. When enabled, the supervisor periodically reads a backpressure verdict from Redis and uses it to pause consumer scale-up and probabilistically skip dequeue attempts. The system includes verdicts with optional timestamps to handle staleness, a post-release ramp that gradually resumes dequeueing after backpressure clears, and a dry-run mode for testing. Configuration is managed through environment variables, metrics are exported via Prometheus, and the consumer pool receives a new 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Guard against overlapping refresh ticks when a read hangs; use the staleness-aware computeEngaged() for transition/ramp/gauge bookkeeping; close the backpressure Redis client on supervisor shutdown.
Add TRIGGER_DEQUEUE_BACKPRESSURE_REDIS_PASSWORD to the secret strip-list so it never lands in the DEBUG startup log, with a comment to keep new secrets out.
| computeEngaged(): boolean { | ||
| const verdict = this.verdict; | ||
| if (verdict?.engaged !== true) { | ||
| return false; | ||
| } | ||
|
|
||
| const maxAge = this.opts.maxVerdictAgeMs; | ||
| if (maxAge !== undefined && verdict.ts !== undefined && Date.now() - verdict.ts > maxAge) { | ||
| return false; | ||
| } | ||
|
|
||
| return true; | ||
| } |
There was a problem hiding this comment.
🚩 Staleness check bypassed when verdict lacks a ts field
The staleness fail-open in computeEngaged() at apps/supervisor/src/backpressure/backpressureMonitor.ts:106 requires both maxVerdictAgeMs and verdict.ts to be defined. If the cluster-side aggregator writes a verdict like {engaged: true} without a ts field and then dies, the supervisor will never age out that verdict — it stays engaged until the next successful refresh (which might read the same stale key forever if it has no Redis TTL). This is by design (ts is optional in the schema), but operators should be aware: if maxVerdictAgeMs is relied upon for safety, the aggregator must include ts in every verdict it writes. Worth documenting this operational requirement.
Was this helpful? React with 👍 or 👎 to provide feedback.
The supervisor can now pause dequeuing - and freeze consumer-pool scale-up - when a backpressure signal says the cluster can't place more work, then ramp dequeuing back up gradually once it clears. The signal is a verdict published to a Redis key by a cluster-side component; the supervisor reads it on a short refresh and gates
preDequeueon it.Off by default (
TRIGGER_DEQUEUE_BACKPRESSURE_ENABLED). Everything fails open: a missing, stale, or unreadable verdict never pins the brake, and the hot-path read is a synchronous cached lookup with no I/O. The scale-up freeze leaves scale-down untouched, and on release the resume is ramped so a deep queue isn't hammered all at once.Dry-run is on by default (
TRIGGER_DEQUEUE_BACKPRESSURE_DRY_RUN): even once enabled it only logs what it would have done, and surfaces the computed state through metrics, until explicitly set to act. Prometheus:supervisor_backpressure_engaged,_dry_run,_skipped_dequeues_total.Refs TRI-5354