Skip to content

fix(cloud-agent): recover sessions after runtime starvation#3908

Open
eshurakov wants to merge 1 commit into
mainfrom
majestic-glow
Open

fix(cloud-agent): recover sessions after runtime starvation#3908
eshurakov wants to merge 1 commit into
mainfrom
majestic-glow

Conversation

@eshurakov

Copy link
Copy Markdown
Contributor

Summary

Why

Long package installs, formatting, linting, and test commands can starve Kilo inside the shared sandbox while producing thousands of cumulative Bash updates. Observed failures left orphaned Kilo runtimes behind and could exhaust a follow-up message's delivery retries without ever terminalizing it. Recovery therefore has to coordinate durable message state with the physical wrapper/Kilo lifecycle without replaying a turn that may already have changed the workspace.

What was done

  • Coalesce running Bash snapshots at the wrapper ingest boundary to one latest update per 150 ms while preserving leading, terminal, error, interruption, and idle boundaries.
  • Launch Kilo at niceness 10 across production, development, DIND, and bootstrapped devcontainer paths while leaving the wrapper at normal priority.
  • Propagate the leased wrapper identity to Kilo descendants, discover surviving runtimes through /proc, and revalidate exact ownership before signalling direct or devcontainer PIDs.
  • Fail closed when runtime inspection or cleanup is incomplete, and allow warm reuse only when one wrapper and all observed descendants match the same lease generation.
  • Schedule durable repair before persisting exhausted terminalization-pending work so alarm replay can finish failure settlement without redispatching the turn; retain guarded legacy ownership matching for rolling deployments.

High-level architecture

sequenceDiagram
  participant Caller
  participant DO as Cloud Agent Durable Object
  participant Sandbox
  participant Wrapper
  participant Kilo

  Caller->>DO: Admit durable message
  DO->>Sandbox: Prepare or reuse leased runtime
  Sandbox->>Wrapper: Start with instance and generation
  Wrapper->>Kilo: Launch at nice 10 with inherited owner
  Kilo-->>Wrapper: Running Bash snapshots
  Wrapper-->>DO: Leading, latest, and terminal events

  opt Delivery exhausts
    DO->>DO: Schedule repair alarm
    DO->>DO: Persist terminalization-pending
    DO->>DO: Terminalize failed without redispatch
  end

  opt Runtime cleanup is required
    DO->>Sandbox: Inspect wrapper and owned descendants
    Sandbox->>Sandbox: Revalidate owner and signal exact PIDs
    Sandbox-->>DO: Report absent after re-observation
  end
Loading

Architecture decision

Decision: Recover each failure at its owning lifecycle boundary: the Durable Object guarantees settlement repair, the wrapper limits event amplification, and the sandbox lifecycle owns physical descendant cleanup through the existing lease identity.

Context: Durable message state outlives wrapper processes, Kilo runs as a separate process whose commands can survive wrapper exit, and cumulative Bash updates create pressure before events reach the Durable Object.

Rationale: Scheduling repair before terminal disposition removes the stranded-state window; inherited lease ownership remains observable after wrapper death; and coalescing before WebSocket transmission reduces load at the earliest controllable boundary.

Alternatives considered:

  • Replay exhausted delivery. A turn may already have produced workspace or external side effects, so automatic replay can duplicate work.
  • Clean up only wrapper-marker processes. Reparented .kilo serve processes and command descendants can outlive the wrapper and poison subsequent recovery.
  • Coalesce in Worker ingest or the UI. Downstream reduction would not protect the wrapper's event loop, WebSocket, or disconnected-event buffer.

Consequences: Exhausted work can settle durably without redispatch, and cleanup no longer reports absence while an observed owned runtime remains. The trade-off is Linux /proc and shell-tool dependence, intentional loss of intermediate cumulative snapshots, and advisory CPU prioritization rather than hard memory or CPU isolation.

Verification

  • Rebuilt the local sandbox and verified incomplete cleanup held queued delivery until the owned runtime was removed, then drained it successfully.
  • Verified the wrapper remained at nice 0 while kilo-real and .kilo serve ran at nice 10 with the same runtime-owner marker.
  • Completed both a cold message and a warm follow-up; the warm turn reused the same wrapper PID, Kilo PIDs, port, and physical generation and still emitted assistant output and completion.

Visual Changes

N/A

Reviewer Notes

  • Focus on /proc ownership parsing and the guarded pre-KILO_RUNTIME_OWNER fallback; inspection failures intentionally block reuse and cleanup completion.
  • Running Bash coalescing deliberately drops intermediate cumulative snapshots but preserves the first, latest, terminal, and lifecycle-boundary events.
  • Niceness improves scheduler preference but does not reserve memory; hard cgroup isolation is not available through the current sandbox runtime contract.
  • Wrapper handoff remains at-least-once under ambiguous delivery failures; this change repairs exhausted terminalization rather than claiming exactly-once delivery.

@kilo-code-bot

kilo-code-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Executive Summary

This PR introduces well-structured runtime starvation recovery across four coordinated layers (Bash event coalescing, nice-10 Kilo launch shim, /proc-based ownership inspection, and durable terminalization repair), with no security vulnerabilities, memory leaks, or logic bugs found.

Files Reviewed (17 files)
  • services/cloud-agent-next/Dockerfile
  • services/cloud-agent-next/Dockerfile.dev
  • services/cloud-agent-next/Dockerfile.dind
  • services/cloud-agent-next/src/agent-sandbox/cloudflare/cloudflare-agent-sandbox.ts
  • services/cloud-agent-next/src/agent-sandbox/cloudflare/cloudflare-agent-sandbox.test.ts
  • services/cloud-agent-next/src/agent-sandbox/protocol.ts
  • services/cloud-agent-next/src/kilo/devcontainer.ts
  • services/cloud-agent-next/src/kilo/devcontainer.test.ts
  • services/cloud-agent-next/src/kilo/wrapper-client.ts
  • services/cloud-agent-next/src/kilo/wrapper-client.test.ts
  • services/cloud-agent-next/src/kilo/wrapper-manager.ts
  • services/cloud-agent-next/src/kilo/wrapper-manager.test.ts
  • services/cloud-agent-next/src/session/agent-runtime.ts
  • services/cloud-agent-next/src/session/agent-runtime.test.ts
  • services/cloud-agent-next/src/session/pending-messages.ts
  • services/cloud-agent-next/src/session/pending-messages.test.ts
  • services/cloud-agent-next/src/session/session-message-queue.ts
  • services/cloud-agent-next/src/session/session-message-queue.test.ts
  • services/cloud-agent-next/test/integration/session/pending-messages.test.ts
  • services/cloud-agent-next/wrapper/src/connection.ts
  • services/cloud-agent-next/wrapper/src/running-bash-event-coalescer.ts
  • services/cloud-agent-next/wrapper/src/running-bash-event-coalescer.test.ts
Notable Design Observations

Coalescer memory/timer safety: close() calls flushAll() which synchronously cancels all pending timers via clearTimeout before clearing throttledParts. Timers that fire between close() and the event loop find the map already cleared and return immediately. No leak path exists.

Shell injection safety: All dynamic values in /proc-inspection scripts and signal scripts pass through shellQuote(). Ownership values are additionally constrained (no colons in sessionId/instanceId via serializeRuntimeOwner) before being embedded in awk and sed patterns.

Fail-closed on repair scheduling: scheduleTerminalizationRepair is awaited before storage.put for the terminalization-pending disposition. If alarm scheduling fails, the message remains in its pre-exhaustion state and will be retried on the next alarm — correct behavior that avoids orphaning a message without a repair path.

Warm reuse guard: The updated matchingObserved check correctly requires exactly one non-runtime (wrapper) process while allowing any number of co-owned runtime processes, and verifies all observed processes share the same instance identity — preventing warm reuse when a stale runtime from a different generation is present.

AWK field extraction: The ownedRuntimeInspectionCommand awk uses substr($0, index($0, "=") + 1) rather than $2, correctly handling values that contain = (e.g. environment variables with base64 values).

Fix these issues in Kilo Cloud


Reviewed by claude-4.6-sonnet-20260217 · 2,201,760 tokens

Review guidance: REVIEW.md from base branch main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant