Skip to content

fix(cloud-agent): harden workspace bootstrap#3937

Open
eshurakov wants to merge 1 commit into
mainfrom
persistent-rumba
Open

fix(cloud-agent): harden workspace bootstrap#3937
eshurakov wants to merge 1 commit into
mainfrom
persistent-rumba

Conversation

@eshurakov

@eshurakov eshurakov commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Why

Large repositories can remain healthy while cloning or checking out for longer than the wrapper's previous two-minute wall-clock limit. When those operations were interrupted, the remaining .git directory could also make a later attempt reuse an incomplete workspace, turning one timeout into repeated setup failures.

What was done

  • Replace fixed long-command timeouts with a two-minute output-inactivity watchdog and a five-minute hard limit for Git and setup commands.
  • Bound the complete workspace phase to eight minutes and propagate cancellation through Git, setup, snapshot restore, and patch application, preserving headroom inside the existing ten-minute wrapper readiness budget.
  • Mark bootstrap attempts as pending and write a Git completion marker only after branch preparation, session restore, and setup commands finish; incomplete new workspaces are removed and cloned again.
  • Report sanitized Git percentages and generic setup activity without exposing credentials or setup output, then advance the status to Starting Kilo... while the runtime starts.
  • Bound captured subprocess output and failed-workspace cleanup so noisy or interrupted commands cannot grow memory or cleanup time without limit.

High-level architecture

sequenceDiagram
  participant Orchestrator
  participant Wrapper
  participant Workspace
  participant Kilo
  Orchestrator->>Wrapper: POST /session/ready (10-minute outer budget)
  Wrapper->>Workspace: Prepare repository and session (8-minute shared budget)
  alt Workspace is complete
    Workspace-->>Wrapper: Reuse and refresh credentials
  else Workspace is incomplete or cold
    Wrapper->>Workspace: Mark pending, remove stale state, clone, restore, and run setup
    Wrapper->>Workspace: Write completion marker
  end
  Wrapper->>Kilo: Start runtime
  Wrapper-->>Orchestrator: Session ready
Loading

Architecture decision

Decision: Keep bootstrap lifecycle policy in the wrapper and combine activity-aware command watchdogs with a shared workspace deadline and explicit completion markers.

Context: A single elapsed timeout could not distinguish a slow, active checkout from a stalled process, while .git existence alone could not distinguish a usable workspace from an interrupted clone.

Rationale: The wrapper owns the subprocesses and persisted workspace state, so it can observe output, cancel child processes, and write completion state at the point where bootstrap actually succeeds. Layered two-minute inactivity, five-minute command, eight-minute workspace, and ten-minute readiness limits keep each boundary finite without colliding with startup cleanup.

Alternatives considered:

  • Only increase the old Git timeout. This would allow larger repositories but would still wait too long on silent hangs and reuse interrupted workspaces.
  • Continue treating any .git directory as warm. This avoids recloning but preserves the failure mode where partial checkouts are mistaken for complete workspaces.

Consequences: Active long-running operations receive more time and interrupted new workspaces recover deterministically. Silent commands can still time out after two minutes, markerless legacy workspaces use a compatibility heuristic, and cleanup remains best-effort so the original setup failure is preserved.

Verification

  • Verified locally

Visual Changes

Screenshot 2026-06-10 at 15 02 36

Reviewer Notes

  • Review the nested timeout policy and cancellation propagation through repository preparation and snapshot restore.
  • Markerless legacy workspaces are migrated when Kilo auth exists and Git validates the worktree. A narrowly interrupted pre-marker workspace can still satisfy that heuristic; this is accepted because sessions are predominantly new.
  • The separate App Builder Unauthorized: Invalid token clone failure is intentionally not addressed by this PR.

@kilo-code-bot

kilo-code-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Code Review Summary

Status: 1 Issue Found | Recommendation: Address before merge

Executive Summary

restoreSession does not handle the two new TerminationReason values (inactivity_timeout, hard_timeout) introduced by this PR, producing a misleading error message when kilo import is killed by those timers.

Overview

Severity Count
CRITICAL 0
WARNING 1
SUGGESTION 0
Issue Details (click to expand)

WARNING

File Line Issue
services/cloud-agent-next/wrapper/src/restore-session.ts 586 terminationReason === 'timeout' check does not cover new 'inactivity_timeout' / 'hard_timeout' reasons
Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

services/cloud-agent-next/wrapper/src/restore-session.ts:586 — The existing check:

if (importResult.terminationReason === 'timeout') {
  return fail(`kilo import timed out after ${importTimeoutMs}ms`, null, 'import');
}

This PR introduces two new TerminationReason values (inactivity_timeout, hard_timeout) to utils.ts. When kilo import is terminated by one of those two new timers, the guard above will not match, and the code will fall through to:

if (importResult.exitCode !== 0) {
  return fail(`kilo import failed exitCode=${importResult.exitCode}`, null, 'import');
}

The returned error string 'kilo import failed exitCode=124' is misleading — it obscures the root cause (an inactivity or hard timeout). For the workspace-deadline use case the outer wrapper converts any abort to a 'Workspace preparation timed out' message, so no user-visible regression exists there. But the restoreSession CLI entrypoint and any direct callers relying on structured result.error will receive the opaque message.

Suggested fix:

const isTimeoutLike =
  importResult.terminationReason === 'timeout' ||
  importResult.terminationReason === 'inactivity_timeout' ||
  importResult.terminationReason === 'hard_timeout';
if (isTimeoutLike) {
  return fail(`kilo import timed out after ${importTimeoutMs}ms`, null, 'import');
}
Files Reviewed (5 files)
  • services/cloud-agent-next/src/execution/orchestrator.ts — comment update
  • services/cloud-agent-next/wrapper/src/restore-session.ts — signal propagation through download, validate, import, extract-diffs, apply-patch; applyPatch converted from sync to async
  • services/cloud-agent-next/wrapper/src/session-bootstrap.ts — bootstrap completion markers, workspace deadline, activity watchdogs, sanitized progress reporting, legacy workspace migration
  • services/cloud-agent-next/wrapper/src/utils.tsinactivityTimeoutMs, hardTimeoutMs, onOutput callback, bounded output capture
  • Test files — comprehensive coverage of new paths

Fix these issues in Kilo Cloud


Reviewed by claude-4.6-sonnet-20260217 · 3,331,921 tokens

Review guidance: REVIEW.md from base branch main

if (!(await exists(path.join(workspacePath, '.git')))) return false;
if (await exists(gitBootstrapMarkerPath(workspacePath))) return true;
if (await exists(bootstrapPendingMarkerPath(sessionHome))) return false;
if (!(await exists(sessionAuthFilePath(sessionHome)))) return false;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this require evidence written after restore/setup rather than only auth.json? The old flow writes auth before both potentially long steps, so a legacy attempt interrupted during restore or setup is migrated as complete and will skip recovery.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants