fix(cloud-agent): harden workspace bootstrap#3937
Conversation
Code Review SummaryStatus: 1 Issue Found | Recommendation: Address before merge Executive Summary
Overview
Issue Details (click to expand)WARNING
Other Observations (not in diff)Issues found in unchanged code that cannot receive inline comments:
if (importResult.terminationReason === 'timeout') {
return fail(`kilo import timed out after ${importTimeoutMs}ms`, null, 'import');
}This PR introduces two new if (importResult.exitCode !== 0) {
return fail(`kilo import failed exitCode=${importResult.exitCode}`, null, 'import');
}The returned error string Suggested fix: const isTimeoutLike =
importResult.terminationReason === 'timeout' ||
importResult.terminationReason === 'inactivity_timeout' ||
importResult.terminationReason === 'hard_timeout';
if (isTimeoutLike) {
return fail(`kilo import timed out after ${importTimeoutMs}ms`, null, 'import');
}Files Reviewed (5 files)
Fix these issues in Kilo Cloud Reviewed by claude-4.6-sonnet-20260217 · 3,331,921 tokens Review guidance: REVIEW.md from base branch |
| if (!(await exists(path.join(workspacePath, '.git')))) return false; | ||
| if (await exists(gitBootstrapMarkerPath(workspacePath))) return true; | ||
| if (await exists(bootstrapPendingMarkerPath(sessionHome))) return false; | ||
| if (!(await exists(sessionAuthFilePath(sessionHome)))) return false; |
There was a problem hiding this comment.
Can this require evidence written after restore/setup rather than only auth.json? The old flow writes auth before both potentially long steps, so a legacy attempt interrupted during restore or setup is migrated as complete and will skip recovery.
Summary
Why
Large repositories can remain healthy while cloning or checking out for longer than the wrapper's previous two-minute wall-clock limit. When those operations were interrupted, the remaining
.gitdirectory could also make a later attempt reuse an incomplete workspace, turning one timeout into repeated setup failures.What was done
Starting Kilo...while the runtime starts.High-level architecture
sequenceDiagram participant Orchestrator participant Wrapper participant Workspace participant Kilo Orchestrator->>Wrapper: POST /session/ready (10-minute outer budget) Wrapper->>Workspace: Prepare repository and session (8-minute shared budget) alt Workspace is complete Workspace-->>Wrapper: Reuse and refresh credentials else Workspace is incomplete or cold Wrapper->>Workspace: Mark pending, remove stale state, clone, restore, and run setup Wrapper->>Workspace: Write completion marker end Wrapper->>Kilo: Start runtime Wrapper-->>Orchestrator: Session readyArchitecture decision
Decision: Keep bootstrap lifecycle policy in the wrapper and combine activity-aware command watchdogs with a shared workspace deadline and explicit completion markers.
Context: A single elapsed timeout could not distinguish a slow, active checkout from a stalled process, while
.gitexistence alone could not distinguish a usable workspace from an interrupted clone.Rationale: The wrapper owns the subprocesses and persisted workspace state, so it can observe output, cancel child processes, and write completion state at the point where bootstrap actually succeeds. Layered two-minute inactivity, five-minute command, eight-minute workspace, and ten-minute readiness limits keep each boundary finite without colliding with startup cleanup.
Alternatives considered:
.gitdirectory as warm. This avoids recloning but preserves the failure mode where partial checkouts are mistaken for complete workspaces.Consequences: Active long-running operations receive more time and interrupted new workspaces recover deterministically. Silent commands can still time out after two minutes, markerless legacy workspaces use a compatibility heuristic, and cleanup remains best-effort so the original setup failure is preserved.
Verification
Visual Changes
Reviewer Notes
Unauthorized: Invalid tokenclone failure is intentionally not addressed by this PR.