fix(cloud-agent): recover sessions after runtime starvation by eshurakov · Pull Request #3908 · Kilo-Org/cloud

eshurakov · 2026-06-10T09:25:51Z

Summary

Why

Long package installs, formatting, linting, and test commands can starve Kilo inside the shared sandbox while producing thousands of cumulative Bash updates. Observed failures left orphaned Kilo runtimes behind and could exhaust a follow-up message's delivery retries without ever terminalizing it. Recovery therefore has to coordinate durable message state with the physical wrapper/Kilo lifecycle without replaying a turn that may already have changed the workspace.

What was done

Coalesce running Bash snapshots at the wrapper ingest boundary to one latest update per 150 ms while preserving leading, terminal, error, interruption, and idle boundaries.
Launch Kilo at niceness 10 across production, development, DIND, and bootstrapped devcontainer paths while leaving the wrapper at normal priority.
Propagate the leased wrapper identity to Kilo descendants, discover surviving runtimes through /proc, and revalidate exact ownership before signalling direct or devcontainer PIDs.
Fail closed when runtime inspection or cleanup is incomplete, and allow warm reuse only when one wrapper and all observed descendants match the same lease generation.
Schedule durable repair before persisting exhausted terminalization-pending work so alarm replay can finish failure settlement without redispatching the turn; retain guarded legacy ownership matching for rolling deployments.

High-level architecture

sequenceDiagram
  participant Caller
  participant DO as Cloud Agent Durable Object
  participant Sandbox
  participant Wrapper
  participant Kilo

  Caller->>DO: Admit durable message
  DO->>Sandbox: Prepare or reuse leased runtime
  Sandbox->>Wrapper: Start with instance and generation
  Wrapper->>Kilo: Launch at nice 10 with inherited owner
  Kilo-->>Wrapper: Running Bash snapshots
  Wrapper-->>DO: Leading, latest, and terminal events

  opt Delivery exhausts
    DO->>DO: Schedule repair alarm
    DO->>DO: Persist terminalization-pending
    DO->>DO: Terminalize failed without redispatch
  end

  opt Runtime cleanup is required
    DO->>Sandbox: Inspect wrapper and owned descendants
    Sandbox->>Sandbox: Revalidate owner and signal exact PIDs
    Sandbox-->>DO: Report absent after re-observation
  end

Architecture decision

Decision: Recover each failure at its owning lifecycle boundary: the Durable Object guarantees settlement repair, the wrapper limits event amplification, and the sandbox lifecycle owns physical descendant cleanup through the existing lease identity.

Context: Durable message state outlives wrapper processes, Kilo runs as a separate process whose commands can survive wrapper exit, and cumulative Bash updates create pressure before events reach the Durable Object.

Rationale: Scheduling repair before terminal disposition removes the stranded-state window; inherited lease ownership remains observable after wrapper death; and coalescing before WebSocket transmission reduces load at the earliest controllable boundary.

Alternatives considered:

Replay exhausted delivery. A turn may already have produced workspace or external side effects, so automatic replay can duplicate work.
Clean up only wrapper-marker processes. Reparented .kilo serve processes and command descendants can outlive the wrapper and poison subsequent recovery.
Coalesce in Worker ingest or the UI. Downstream reduction would not protect the wrapper's event loop, WebSocket, or disconnected-event buffer.

Consequences: Exhausted work can settle durably without redispatch, and cleanup no longer reports absence while an observed owned runtime remains. The trade-off is Linux /proc and shell-tool dependence, intentional loss of intermediate cumulative snapshots, and advisory CPU prioritization rather than hard memory or CPU isolation.

Verification

Rebuilt the local sandbox and verified incomplete cleanup held queued delivery until the owned runtime was removed, then drained it successfully.
Verified the wrapper remained at nice 0 while kilo-real and .kilo serve ran at nice 10 with the same runtime-owner marker.
Completed both a cold message and a warm follow-up; the warm turn reused the same wrapper PID, Kilo PIDs, port, and physical generation and still emitted assistant output and completion.

Visual Changes

N/A

Reviewer Notes

Focus on /proc ownership parsing and the guarded pre-KILO_RUNTIME_OWNER fallback; inspection failures intentionally block reuse and cleanup completion.
Running Bash coalescing deliberately drops intermediate cumulative snapshots but preserves the first, latest, terminal, and lifecycle-boundary events.
Niceness improves scheduler preference but does not reserve memory; hard cgroup isolation is not available through the current sandbox runtime contract.
Wrapper handoff remains at-least-once under ambiguous delivery failures; this change repairs exhausted terminalization rather than claiming exactly-once delivery.

kilo-code-bot · 2026-06-10T09:30:56Z

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Executive Summary

This PR introduces well-structured runtime starvation recovery across four coordinated layers (Bash event coalescing, nice-10 Kilo launch shim, /proc-based ownership inspection, and durable terminalization repair), with no security vulnerabilities, memory leaks, or logic bugs found.

Files Reviewed (17 files)

services/cloud-agent-next/Dockerfile
services/cloud-agent-next/Dockerfile.dev
services/cloud-agent-next/Dockerfile.dind
services/cloud-agent-next/src/agent-sandbox/cloudflare/cloudflare-agent-sandbox.ts
services/cloud-agent-next/src/agent-sandbox/cloudflare/cloudflare-agent-sandbox.test.ts
services/cloud-agent-next/src/agent-sandbox/protocol.ts
services/cloud-agent-next/src/kilo/devcontainer.ts
services/cloud-agent-next/src/kilo/devcontainer.test.ts
services/cloud-agent-next/src/kilo/wrapper-client.ts
services/cloud-agent-next/src/kilo/wrapper-client.test.ts
services/cloud-agent-next/src/kilo/wrapper-manager.ts
services/cloud-agent-next/src/kilo/wrapper-manager.test.ts
services/cloud-agent-next/src/session/agent-runtime.ts
services/cloud-agent-next/src/session/agent-runtime.test.ts
services/cloud-agent-next/src/session/pending-messages.ts
services/cloud-agent-next/src/session/pending-messages.test.ts
services/cloud-agent-next/src/session/session-message-queue.ts
services/cloud-agent-next/src/session/session-message-queue.test.ts
services/cloud-agent-next/test/integration/session/pending-messages.test.ts
services/cloud-agent-next/wrapper/src/connection.ts
services/cloud-agent-next/wrapper/src/running-bash-event-coalescer.ts
services/cloud-agent-next/wrapper/src/running-bash-event-coalescer.test.ts

Notable Design Observations

Coalescer memory/timer safety: close() calls flushAll() which synchronously cancels all pending timers via clearTimeout before clearing throttledParts. Timers that fire between close() and the event loop find the map already cleared and return immediately. No leak path exists.

Shell injection safety: All dynamic values in /proc-inspection scripts and signal scripts pass through shellQuote(). Ownership values are additionally constrained (no colons in sessionId/instanceId via serializeRuntimeOwner) before being embedded in awk and sed patterns.

Fail-closed on repair scheduling: scheduleTerminalizationRepair is awaited before storage.put for the terminalization-pending disposition. If alarm scheduling fails, the message remains in its pre-exhaustion state and will be retried on the next alarm — correct behavior that avoids orphaning a message without a repair path.

Warm reuse guard: The updated matchingObserved check correctly requires exactly one non-runtime (wrapper) process while allowing any number of co-owned runtime processes, and verifies all observed processes share the same instance identity — preventing warm reuse when a stale runtime from a different generation is present.

AWK field extraction: The ownedRuntimeInspectionCommand awk uses substr($0, index($0, "=") + 1) rather than $2, correctly handling values that contain = (e.g. environment variables with base64 values).

Fix these issues in Kilo Cloud

_{Reviewed by claude-4.6-sonnet-20260217 · 2,201,760 tokens}

_{Review guidance: REVIEW.md from base branch main}

fix(cloud-agent-next): recover starved sessions safely

f8705b5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cloud-agent): recover sessions after runtime starvation#3908

fix(cloud-agent): recover sessions after runtime starvation#3908
eshurakov wants to merge 1 commit into
mainfrom
majestic-glow

eshurakov commented Jun 10, 2026

Uh oh!

kilo-code-bot Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eshurakov commented Jun 10, 2026

Summary

Why

What was done

High-level architecture

Architecture decision

Verification

Visual Changes

Reviewer Notes

Uh oh!

kilo-code-bot Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Executive Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kilo-code-bot Bot commented Jun 10, 2026 •

edited

Loading