feat(durable): large-payload overflow for Durable Execution (batch ReplayChildren + batcher byte cap)#2411
Draft
GarrettBeatty wants to merge 21 commits into
Draft
feat(durable): large-payload overflow for Durable Execution (batch ReplayChildren + batcher byte cap)#2411GarrettBeatty wants to merge 21 commits into
GarrettBeatty wants to merge 21 commits into
Conversation
Adds parallel branch execution to the .NET Durable Execution SDK.
ParallelAsync runs N branches concurrently with configurable concurrency
limits and completion policies, returning an IBatchResult<T> with
per-branch status and error information.
Per-branch checkpoint payloads are serialized via the ILambdaSerializer
registered on ILambdaContext.Serializer (typically configured through
LambdaBootstrapBuilder.Create(handler, serializer)), matching the
StepAsync / RunInChildContextAsync pattern. There are no separate
reflection / AOT-safe overload pairs: the AOT story is determined
entirely by which serializer the user registers with the runtime.
Public surface:
- IDurableContext.ParallelAsync<T> (2 overloads: Func[] vs
DurableBranch<T>[])
- DurableBranch<T> record (Name + Func)
- ParallelConfig (MaxConcurrency, CompletionConfig, NestingType)
- CompletionConfig with factories AllSuccessful() / FirstSuccessful() /
AllCompleted(); ToleratedFailureCount / ToleratedFailurePercentage
(validated 0.0-1.0)
- IBatchResult<T> with All / Succeeded / Failed / Started accessors,
GetResults, GetErrors, ThrowIfError, HasFailure, CompletionReason,
count properties
- IBatchItem<T> with Index, Name, Status, Result, Error
- BatchItemStatus { Succeeded, Failed, Started }
- CompletionReason { AllCompleted, MinSuccessfulReached,
FailureToleranceExceeded }
- NestingType (Nested default; Flat throws NotSupportedException - reserved)
- ParallelException (carries IBatchResult; future-subclassable)
Internal:
- ParallelOperation<T> orchestrator dispatches branches with optional
semaphore-bounded concurrency. Each branch runs as a
ChildContextOperation<T> with deterministic ID via
OperationIdGenerator.CreateChild.
- Branch failures aggregated as IBatchItem<T> entries; orchestrator
throws ParallelException only when CompletionConfig signals
FailureToleranceExceeded.
- Parent CONTEXT checkpoint records summary (CompletionReason +
per-branch index/name/status); branch results live on per-branch
CONTEXT checkpoints.
- ExecutionState now thread-safe (lock around reads/writes of
_operations, _visitedOperations, _isReplaying). Required for
concurrent branch replay; affects all operations but no regressions.
- ParallelOperation awaits Task.WhenAll(inFlight) before disposing
the semaphore so cancellation/exception during dispatch lets
in-flight branches settle cleanly.
- Reuses OperationSubTypes.Parallel / OperationSubTypes.ParallelBranch
from Wave 0.
Adds 31 unit tests + 6 integration tests covering CompletionConfig
matrix, MaxConcurrency, FirstSuccessful short-circuit, replay
determinism, mixed-status replay, cancellation, and concurrency
stress.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add range validation to CompletionConfig.MinSuccessful (>= 1) and ToleratedFailureCount (>= 0), matching the existing ToleratedFailurePercentage setter. Previously zero/negative values produced nonsensical immediate short-circuits. - ReconstructFromCheckpoints now uses the branch Name persisted in the parallel summary instead of always reading the current branch name, and throws NonDeterministicExecutionException on name drift between deployments (the prior path silently ignored summaryEntry.Name). - Correct XML docs for BatchItemStatus.Started / IBatchResult.Started / CompletionConfig.FirstSuccessful: Started means a branch was not dispatched before a completion short-circuit fired (or has no checkpoint on replay), not that it is still running.
Implements IDurableContext.MapAsync, processing a collection in parallel with one child context per item. Mirrors the Python/JS/Java SDKs, where Map is a sibling of Parallel sharing one concurrency engine. - Extract ConcurrentOperation<T> base holding all orchestration, completion, checkpoint, and replay logic; ParallelOperation and MapOperation are thin subclasses supplying only the per-unit (name, func), sub-type labels, and failure-exception factory. - MapConfig defaults CompletionConfig to AllCompleted() (permissive), matching Python/Java Map; intentionally differs from ParallelConfig's AllSuccessful(). Adds ItemNamer; no ItemBatcher (not implemented in any reference SDK). - New MapException so callers can distinguish Map from Parallel failures. - Generalize ParallelSummary/ParallelJsonContext into shared BatchSummary/ BatchJsonContext. - Tests: 24 unit tests (MapOperationTests) + 6 integration functions/tests mirroring the Parallel set. Full suite 325/325 on net8.0 and net10.0.
There was a problem hiding this comment.
Semgrep OSS found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
54a24e4 to
e1db9a7
Compare
|
|
||
| COPY bin/publish/ ${LAMBDA_TASK_ROOT} | ||
|
|
||
| ENTRYPOINT ["/var/task/bootstrap"] |
|
|
||
| COPY bin/publish/ ${LAMBDA_TASK_ROOT} | ||
|
|
||
| ENTRYPOINT ["/var/task/bootstrap"] |
|
|
||
| COPY bin/publish/ ${LAMBDA_TASK_ROOT} | ||
|
|
||
| ENTRYPOINT ["/var/task/bootstrap"] |
|
|
||
| COPY bin/publish/ ${LAMBDA_TASK_ROOT} | ||
|
|
||
| ENTRYPOINT ["/var/task/bootstrap"] |
|
|
||
| COPY bin/publish/ ${LAMBDA_TASK_ROOT} | ||
|
|
||
| ENTRYPOINT ["/var/task/bootstrap"] |
|
|
||
| COPY bin/publish/ ${LAMBDA_TASK_ROOT} | ||
|
|
||
| ENTRYPOINT ["/var/task/bootstrap"] |
|
|
||
| COPY bin/publish/ ${LAMBDA_TASK_ROOT} | ||
|
|
||
| ENTRYPOINT ["/var/task/bootstrap"] |
|
|
||
| COPY bin/publish/ ${LAMBDA_TASK_ROOT} | ||
|
|
||
| ENTRYPOINT ["/var/task/bootstrap"] |
|
|
||
| COPY bin/publish/ ${LAMBDA_TASK_ROOT} | ||
|
|
||
| ENTRYPOINT ["/var/task/bootstrap"] |
|
|
||
| COPY bin/publish/ ${LAMBDA_TASK_ROOT} | ||
|
|
||
| ENTRYPOINT ["/var/task/bootstrap"] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
#2216
Adds large-payload overflow handling to
Amazon.Lambda.DurableExecution, bringing the .NET SDK to parity with the Python/Java/JS SDKs. Two scenarios:ReplayChildrenmechanism. When a FLAT concurrent operation's serializedBatchSummary(or a single child-context result) exceeds the 256 KB per-operation checkpoint threshold, the SDK strips the inline per-unit results, setsOperationUpdate.ContextOptions.ReplayChildren = trueon the parent CONTEXT op, and keeps the full result in memory for the current invoke. On a later replay, the inboundContextDetails.ReplayChildrenflag routes the operation to re-execute its unit/child bodies (recovering the stripped values), while reading per-unitStatus+CompletionReasonfrom the frozen summary and never re-checkpointing the already-terminal parent.MaxBatchBytes(~750 KB) service request limit is now enforced: the worker pre-flushes before an item would push the batch over the byte (or count) cap, and sends a lone oversized item by itself. This is the scenario that map/parallel fan-out actually fills.Out of scope (deferred): final Lambda response > 6 MB. That path depends on the service accepting a function-emitted
EXECUTION SUCCEEDcheckpoint, which needs confirmation with the service team; it will be a follow-up PR.Reference-SDK parity (the key design decision)
On overflow replay the implementation re-executes only the units the frozen summary recorded as
Succeeded/Failed, and skips units recordedStarted(short-circuited, never dispatched) — their bodies are not re-run, so no spurious side effects. This matches Python (replay()re-runs only succeeded children), Java (stripMapResultkeeps the status map + pins the reason viaExpectedCompletionStatus), and JS (skips incomplete items on replay). Per-unit status and completion reason are authoritative from the frozen summary; the unit-name drift check is preserved.Changed source files
Internal/DurableConstants.cs(new) —MaxOperationCheckpointBytes = 256 KB(+ reservedMaxLambdaResponseBytesfor the deferred scenario).Operation.cs+Services/LambdaDurableServiceClient.cs— inboundContextDetails.ReplayChildrenfield + mapping.Internal/ConcurrentOperation.cs— overflow strip-on-checkpoint;ReplayChildrenAsyncre-execution path;RunSingleUnitAsyncextracted fromRunUnitAsync(shared, no dispatch duplication).Internal/ChildContextOperation.cs— single-child overflow;_suppressTerminalCheckpointsuppresses both SUCCEED and FAIL re-emit on overflow replay.Internal/CheckpointBatcher.cs+CheckpointBatcherConfig.cs—EstimateUpdateBytes,PendingBatch, byte-cap pre-flush.All new JSON serialization reuses the source-generated
BatchJsonContext; no reflection-based serialization is introduced (project builds withTreatWarningsAsErrors+IL2026/IL2067/IL2075/IL3050as errors).Test Plan
Amazon.Lambda.DurableExecution.Tests). New coverage: emit-side stripping + flag on both overflow sites, replay re-execution recovering values, STARTED-unit skip, failed-unit error recovery, no-re-checkpoint assertions, inboundReplayChildrenmapping round-trip, batcher byte-cap split + lone-oversized-item.IL2026/IL3050warnings.ParallelFlatOverflowTest) — authored; deploys a Dockerized durable Lambda that overflows a FLAT parallel and forces a 3-invoke sequence exercisingReplayChildrenAsyncend-to-end against the live durable-execution service. To be run live (requires AWS credentials + Docker).🤖 Draft — opening for early review while the integration test is run live.