Skip to content

feat(durable): large-payload overflow for Durable Execution (batch ReplayChildren + batcher byte cap)#2411

Draft
GarrettBeatty wants to merge 21 commits into
gcbeatty/flatfrom
gcbeatty/durable-payload-overflow
Draft

feat(durable): large-payload overflow for Durable Execution (batch ReplayChildren + batcher byte cap)#2411
GarrettBeatty wants to merge 21 commits into
gcbeatty/flatfrom
gcbeatty/durable-payload-overflow

Conversation

@GarrettBeatty

@GarrettBeatty GarrettBeatty commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

#2216

Adds large-payload overflow handling to Amazon.Lambda.DurableExecution, bringing the .NET SDK to parity with the Python/Java/JS SDKs. Two scenarios:

  • Batch (Map/Parallel/ChildContext) results > 256 KB — via the ReplayChildren mechanism. When a FLAT concurrent operation's serialized BatchSummary (or a single child-context result) exceeds the 256 KB per-operation checkpoint threshold, the SDK strips the inline per-unit results, sets OperationUpdate.ContextOptions.ReplayChildren = true on the parent CONTEXT op, and keeps the full result in memory for the current invoke. On a later replay, the inbound ContextDetails.ReplayChildren flag routes the operation to re-execute its unit/child bodies (recovering the stripped values), while reading per-unit Status + CompletionReason from the frozen summary and never re-checkpointing the already-terminal parent.
  • CheckpointBatcher byte enforcement — the previously-unenforced MaxBatchBytes (~750 KB) service request limit is now enforced: the worker pre-flushes before an item would push the batch over the byte (or count) cap, and sends a lone oversized item by itself. This is the scenario that map/parallel fan-out actually fills.

Out of scope (deferred): final Lambda response > 6 MB. That path depends on the service accepting a function-emitted EXECUTION SUCCEED checkpoint, which needs confirmation with the service team; it will be a follow-up PR.

Reference-SDK parity (the key design decision)

On overflow replay the implementation re-executes only the units the frozen summary recorded as Succeeded/Failed, and skips units recorded Started (short-circuited, never dispatched) — their bodies are not re-run, so no spurious side effects. This matches Python (replay() re-runs only succeeded children), Java (stripMapResult keeps the status map + pins the reason via ExpectedCompletionStatus), and JS (skips incomplete items on replay). Per-unit status and completion reason are authoritative from the frozen summary; the unit-name drift check is preserved.

Changed source files

  • Internal/DurableConstants.cs (new) — MaxOperationCheckpointBytes = 256 KB (+ reserved MaxLambdaResponseBytes for the deferred scenario).
  • Operation.cs + Services/LambdaDurableServiceClient.cs — inbound ContextDetails.ReplayChildren field + mapping.
  • Internal/ConcurrentOperation.cs — overflow strip-on-checkpoint; ReplayChildrenAsync re-execution path; RunSingleUnitAsync extracted from RunUnitAsync (shared, no dispatch duplication).
  • Internal/ChildContextOperation.cs — single-child overflow; _suppressTerminalCheckpoint suppresses both SUCCEED and FAIL re-emit on overflow replay.
  • Internal/CheckpointBatcher.cs + CheckpointBatcherConfig.csEstimateUpdateBytes, PendingBatch, byte-cap pre-flush.

All new JSON serialization reuses the source-generated BatchJsonContext; no reflection-based serialization is introduced (project builds with TreatWarningsAsErrors + IL2026/IL2067/IL2075/IL3050 as errors).

Test Plan

  • Unit tests pass: 341/341 on net8.0 and net10.0 (Amazon.Lambda.DurableExecution.Tests). New coverage: emit-side stripping + flag on both overflow sites, replay re-execution recovering values, STARTED-unit skip, failed-unit error recovery, no-re-checkpoint assertions, inbound ReplayChildren mapping round-trip, batcher byte-cap split + lone-oversized-item.
  • Release build clean (0 warnings) with warnings-as-errors.
  • AOT publish: trim/AOT analysis passes with no IL2026/IL3050 warnings.
  • Integration test (ParallelFlatOverflowTest) — authored; deploys a Dockerized durable Lambda that overflows a FLAT parallel and forces a 3-invoke sequence exercising ReplayChildrenAsync end-to-end against the live durable-execution service. To be run live (requires AWS credentials + Docker).

🤖 Draft — opening for early review while the integration test is run live.

GarrettBeatty and others added 21 commits June 5, 2026 11:59
Adds parallel branch execution to the .NET Durable Execution SDK.
ParallelAsync runs N branches concurrently with configurable concurrency
limits and completion policies, returning an IBatchResult<T> with
per-branch status and error information.

Per-branch checkpoint payloads are serialized via the ILambdaSerializer
registered on ILambdaContext.Serializer (typically configured through
LambdaBootstrapBuilder.Create(handler, serializer)), matching the
StepAsync / RunInChildContextAsync pattern. There are no separate
reflection / AOT-safe overload pairs: the AOT story is determined
entirely by which serializer the user registers with the runtime.

Public surface:
- IDurableContext.ParallelAsync<T> (2 overloads: Func[] vs
  DurableBranch<T>[])
- DurableBranch<T> record (Name + Func)
- ParallelConfig (MaxConcurrency, CompletionConfig, NestingType)
- CompletionConfig with factories AllSuccessful() / FirstSuccessful() /
  AllCompleted(); ToleratedFailureCount / ToleratedFailurePercentage
  (validated 0.0-1.0)
- IBatchResult<T> with All / Succeeded / Failed / Started accessors,
  GetResults, GetErrors, ThrowIfError, HasFailure, CompletionReason,
  count properties
- IBatchItem<T> with Index, Name, Status, Result, Error
- BatchItemStatus { Succeeded, Failed, Started }
- CompletionReason { AllCompleted, MinSuccessfulReached,
  FailureToleranceExceeded }
- NestingType (Nested default; Flat throws NotSupportedException - reserved)
- ParallelException (carries IBatchResult; future-subclassable)

Internal:
- ParallelOperation<T> orchestrator dispatches branches with optional
  semaphore-bounded concurrency. Each branch runs as a
  ChildContextOperation<T> with deterministic ID via
  OperationIdGenerator.CreateChild.
- Branch failures aggregated as IBatchItem<T> entries; orchestrator
  throws ParallelException only when CompletionConfig signals
  FailureToleranceExceeded.
- Parent CONTEXT checkpoint records summary (CompletionReason +
  per-branch index/name/status); branch results live on per-branch
  CONTEXT checkpoints.
- ExecutionState now thread-safe (lock around reads/writes of
  _operations, _visitedOperations, _isReplaying). Required for
  concurrent branch replay; affects all operations but no regressions.
- ParallelOperation awaits Task.WhenAll(inFlight) before disposing
  the semaphore so cancellation/exception during dispatch lets
  in-flight branches settle cleanly.
- Reuses OperationSubTypes.Parallel / OperationSubTypes.ParallelBranch
  from Wave 0.

Adds 31 unit tests + 6 integration tests covering CompletionConfig
matrix, MaxConcurrency, FirstSuccessful short-circuit, replay
determinism, mixed-status replay, cancellation, and concurrency
stress.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add range validation to CompletionConfig.MinSuccessful (>= 1) and
  ToleratedFailureCount (>= 0), matching the existing
  ToleratedFailurePercentage setter. Previously zero/negative values
  produced nonsensical immediate short-circuits.
- ReconstructFromCheckpoints now uses the branch Name persisted in the
  parallel summary instead of always reading the current branch name,
  and throws NonDeterministicExecutionException on name drift between
  deployments (the prior path silently ignored summaryEntry.Name).
- Correct XML docs for BatchItemStatus.Started / IBatchResult.Started /
  CompletionConfig.FirstSuccessful: Started means a branch was not
  dispatched before a completion short-circuit fired (or has no
  checkpoint on replay), not that it is still running.
Implements IDurableContext.MapAsync, processing a collection in parallel
with one child context per item. Mirrors the Python/JS/Java SDKs, where
Map is a sibling of Parallel sharing one concurrency engine.

- Extract ConcurrentOperation<T> base holding all orchestration, completion,
  checkpoint, and replay logic; ParallelOperation and MapOperation are thin
  subclasses supplying only the per-unit (name, func), sub-type labels, and
  failure-exception factory.
- MapConfig defaults CompletionConfig to AllCompleted() (permissive), matching
  Python/Java Map; intentionally differs from ParallelConfig's AllSuccessful().
  Adds ItemNamer; no ItemBatcher (not implemented in any reference SDK).
- New MapException so callers can distinguish Map from Parallel failures.
- Generalize ParallelSummary/ParallelJsonContext into shared BatchSummary/
  BatchJsonContext.
- Tests: 24 unit tests (MapOperationTests) + 6 integration functions/tests
  mirroring the Parallel set. Full suite 325/325 on net8.0 and net10.0.

@github-advanced-security github-advanced-security AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semgrep OSS found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

@GarrettBeatty GarrettBeatty changed the base branch from master to gcbeatty/durable-parallel June 8, 2026 17:32
@GarrettBeatty GarrettBeatty changed the base branch from gcbeatty/durable-parallel to gcbeatty/flat June 8, 2026 17:32

COPY bin/publish/ ${LAMBDA_TASK_ROOT}

ENTRYPOINT ["/var/task/bootstrap"]

COPY bin/publish/ ${LAMBDA_TASK_ROOT}

ENTRYPOINT ["/var/task/bootstrap"]

COPY bin/publish/ ${LAMBDA_TASK_ROOT}

ENTRYPOINT ["/var/task/bootstrap"]

COPY bin/publish/ ${LAMBDA_TASK_ROOT}

ENTRYPOINT ["/var/task/bootstrap"]

COPY bin/publish/ ${LAMBDA_TASK_ROOT}

ENTRYPOINT ["/var/task/bootstrap"]

COPY bin/publish/ ${LAMBDA_TASK_ROOT}

ENTRYPOINT ["/var/task/bootstrap"]

COPY bin/publish/ ${LAMBDA_TASK_ROOT}

ENTRYPOINT ["/var/task/bootstrap"]

COPY bin/publish/ ${LAMBDA_TASK_ROOT}

ENTRYPOINT ["/var/task/bootstrap"]

COPY bin/publish/ ${LAMBDA_TASK_ROOT}

ENTRYPOINT ["/var/task/bootstrap"]

COPY bin/publish/ ${LAMBDA_TASK_ROOT}

ENTRYPOINT ["/var/task/bootstrap"]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants