Skip to content

[improve](streaming-job) add from-to cdc WAL-search timeout and stale-reader release#64013

Open
JNSimba wants to merge 9 commits into
apache:masterfrom
JNSimba:feat/streaming-cdc-stuck-task-recovery
Open

[improve](streaming-job) add from-to cdc WAL-search timeout and stale-reader release#64013
JNSimba wants to merge 9 commits into
apache:masterfrom
JNSimba:feat/streaming-cdc-stuck-task-recovery

Conversation

@JNSimba
Copy link
Copy Markdown
Member

@JNSimba JNSimba commented Jun 2, 2026

Proposed changes

Three reliability/observability fixes for from-to (at-least-once) CDC streaming tasks.

  1. Startup timeout. A from-to binlog task whose upstream is idle could block
    indefinitely in the replication startup/locate phase (no first message arrives,
    so the poll loop never times out). This adds a setup-phase timeout — half of the
    FE task timeout, passed down via WriteRecordRequest.taskTimeoutMs — so the task
    exits and commits the current offset gracefully instead of hanging. Snapshot
    splits are explicitly excluded so an incomplete watermark is never committed.

  2. Release a stale reader on failure, guarded by task ownership. On task
    onFail/cancel, FE makes a best-effort request (/api/releaseReader/{taskId})
    asking the previous backend to stop its reader while keeping the replication slot,
    so a reschedule to another backend does not leave two readers competing for the
    same slot. The reader cache is keyed by job id and reused across tasks, so the
    release carries the failing task id and the backend releases only if that task
    still owns the reader; a stale/late RPC becomes a no-op and cannot interrupt a
    replacement task that reused the same reader. The RPC is fire-and-forget so it
    never blocks while the job lock is held.

  3. Surface the first rejected-row detail in the task error. When a stream load
    fails with a data-quality error, the cdc_client now parses the FirstErrorMsg
    field from the response and appends it to the task error reported to FE, so the
    job errorMsg shows the actual offending row instead of only an ErrorURL.

Known limitation: the release is best-effort, so a reschedule may briefly observe
"replication slot is active"; this self-heals via task retry or the source-side
sender timeout.

Further comments

Scoped to the from-to streaming path; snapshot and TVF paths are unaffected.

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@JNSimba JNSimba marked this pull request as ready for review June 2, 2026 08:20
@JNSimba JNSimba force-pushed the feat/streaming-cdc-stuck-task-recovery branch from b647f18 to 139ec20 Compare June 2, 2026 08:29
@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Jun 2, 2026

/review

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Jun 2, 2026

run buildall

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves reliability of from-to (at-least-once) CDC streaming tasks by (1) adding a setup-phase timeout for WAL/binlog search during startup to prevent indefinite hangs when upstream is idle, and (2) introducing a best-effort mechanism for FE to ask the previously scheduled backend to release a stale reader on failure/cancel so reschedules don’t leave two readers competing for the same upstream state (e.g., a Postgres replication slot).

Changes:

  • Add taskTimeoutMs to WriteRecordRequest and enforce a setup-phase “WAL-search/idle” timeout in cdc_client polling (snapshot splits excluded).
  • Add a /api/releaseReader endpoint on cdc_client and an FE best-effort fire-and-forget RPC from StreamingMultiTblTask on fail/cancel to release the previous backend’s reader while keeping upstream slot state.
  • Add targeted tests to ensure release() remains the base implementation (to avoid dropping slots) and that Env.getReaderIfPresent() is a non-creating lookup.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file
File Description
fs_brokers/cdc_client/src/test/java/org/apache/doris/cdcclient/source/reader/AbstractCdcSourceReaderTest.java Adds a reflection-based test ensuring release() is not overridden by specific readers (slot-preserving behavior).
fs_brokers/cdc_client/src/test/java/org/apache/doris/cdcclient/common/EnvTest.java Adds a test for the new non-creating reader lookup method.
fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/source/reader/SourceReader.java Extends the reader contract with release(JobBaseConfig) for slot-preserving shutdown.
fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/source/reader/AbstractCdcSourceReader.java Implements a default release() that stops reading without invoking close-time cleanup that could drop upstream artifacts.
fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/service/PipelineCoordinator.java Adds a setup-phase timeout (based on FE task timeout) to avoid indefinite startup hangs in streaming splits.
fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/controller/ClientController.java Adds /api/releaseReader endpoint to stop a reader engine while keeping upstream slot state.
fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/common/Env.java Adds getReaderIfPresent(jobId) to avoid accidentally creating readers during release attempts.
fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/StreamingMultiTblTask.java Passes taskTimeoutMs to BE; adds best-effort stale-reader release RPC on fail/cancel.
fe/fe-common/src/main/java/org/apache/doris/job/cdc/request/WriteRecordRequest.java Adds taskTimeoutMs field used by BE for setup timeout decisions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: request changes. I found one blocking lifecycle issue in the new stale-reader release path.

Critical checkpoint conclusions:

  • Goal/test: The PR targets CDC from-to startup timeout, stale reader release, and stream-load first error surfacing. Tests cover some helper behavior and JSON parsing, but the stale-release race is not covered.
  • Scope/focus: The change is mostly focused, but the new release endpoint lacks task/generation identity and can affect a different task instance.
  • Concurrency: This PR explicitly involves asynchronous FE failure/cancel and CDC-client async writer threads. The new release RPC can race with a newer writer for the same job on the same backend; see inline comment.
  • Lifecycle: Reader lifecycle is non-intuitive because Env stores readers by jobId while tasks are recreated. The release lifecycle is therefore not safe without task identity.
  • Config: No new dynamic config was added; taskTimeoutMs is derived from existing FE config.
  • Compatibility: Adding taskTimeoutMs to the JSON request is backward-compatible for default zero behavior.
  • Parallel paths: Snapshot splits are intentionally excluded from setup timeout. The release path affects both MySQL and PostgreSQL readers through the shared base implementation.
  • Conditional checks: The setup timeout condition is understandable; the release endpoint needs an additional stale-task guard.
  • Test coverage: Existing tests do not cover delayed release arriving after a newer task starts, which is the blocking scenario.
  • Observability: Logs exist for release attempts, but they do not include a task/generation id, which also contributes to diagnosing the race.
  • Transactions/persistence/data writes: Offset commit still relies on task id validation in FE, but the release race can interrupt the active reader/streamload before commit.
  • FE-BE variables: The release request does not pass task identity, which is the root of the issue.

User focus: No additional user-provided review focus was specified.

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Jun 2, 2026

/review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated no new comments.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking race in the task ownership guard. The PR goal is clear and mostly focused: add setup timeout, best-effort stale reader release, and better stream-load error detail. The FirstErrorMsg plumbing and timeout path are covered by tests, but the stale-release ownership protection still has a concurrency gap.

Critical checkpoint conclusions:

  • Goal/test coverage: The intended behavior is covered partially, but the replacement-reader race below is not covered and can still interrupt a valid replacement task.
  • Scope/focus: The change is reasonably small and scoped to streaming CDC.
  • Concurrency: The PR explicitly changes concurrent FE timeout/cancel RPCs and async cdc_client write tasks; the owner update is not atomic with reader acquisition, leaving a race.
  • Lifecycle: Reader and stream-load instances are shared per job and closed by releaseReader; this lifecycle must be protected by the same generation/ownership transition.
  • Config: No new config items. Existing dynamic timeout multiplier is still read live on FE; the per-request value is fixed for cdc_client execution.
  • Compatibility: Adding an optional request field is compatible with default 0 behavior; the new release API path is only used by the new FE.
  • Parallel paths: Snapshot path is intentionally excluded from the setup timeout; close path is unchanged.
  • Conditional checks: The owner check is necessary but currently not sufficient because owner assignment is separate from getReader().
  • Observability: Logs include job/task/backend identifiers for new release/timeout paths.
  • Data/transaction correctness: No Doris table transaction path changes, but CDC offset progress depends on not killing the replacement reader; the race can cause repeated task failure and slot contention.

User focus: No additional user-provided review focus was specified.

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Jun 2, 2026

/review

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Jun 2, 2026

run buildall

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: request changes.

Critical checkpoint conclusions:

  • Goal/test: The PR improves CDC reader release on failure/cancel, avoids stale releases by task owner, adds a setup-phase timeout, and surfaces FirstErrorMsg. The stale-release owner checks now address the already-known jobId-only issue, and tests cover parsing and some Env no-op cases, but the core handoff behavior still has an uncovered lifecycle race.
  • Scope/focus: The change is mostly focused on streaming CDC task failure/cancel handling and error reporting.
  • Concurrency: There is a remaining concurrency/lifecycle issue: FE sends releaseReader asynchronously and immediately continues failure/cancel processing, so a restarted/resumed task can start before the old backend has actually stopped the CDC engine. This defeats the stated guarantee that reschedule never leaves two readers competing for the same source.
  • Lifecycle: Reader ownership is now tracked per task on the backend, but valid-owner release completion is not synchronized with FE task lifecycle.
  • Configuration: No new config item was added; existing timeout multiplier remains dynamically read.
  • Compatibility: The new request field is additive and defaults to 0 on older senders, disabling only the new backend setup timeout.
  • Parallel paths: Close/release and writeRecords paths were reviewed; the issue is in failure/cancel release ordering, not the already-commented stale release guard.
  • Tests: Added unit/regression coverage does not exercise the asynchronous release vs new task start race.
  • Observability: Logs were added and are useful, but they do not prevent the race.
  • Transaction/persistence/data writes: No direct storage-format or EditLog change was found in this PR.
  • Performance: No blocking hot-path performance issue found, but the non-blocking release is currently used where correctness depends on the release having completed.

User focus: No additional user-provided focus was specified.

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29286 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 03b53d22808358ec2d00a2a89efa09610b7936a1, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17798	4125	4117	4117
q2	q3	10770	1345	861	861
q4	4687	475	347	347
q5	7543	886	602	602
q6	189	177	138	138
q7	764	839	644	644
q8	9387	1557	1612	1557
q9	5906	4579	4530	4530
q10	6758	1803	1511	1511
q11	441	277	250	250
q12	628	426	292	292
q13	18118	3345	2752	2752
q14	275	261	244	244
q15	q16	820	770	708	708
q17	969	964	890	890
q18	6942	5680	5539	5539
q19	1308	1257	1029	1029
q20	536	428	268	268
q21	6516	2789	2681	2681
q22	486	394	326	326
Total cold run time: 100841 ms
Total hot run time: 29286 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	5193	4762	4928	4762
q2	q3	4910	5270	4668	4668
q4	2064	2206	1440	1440
q5	4821	4829	4676	4676
q6	234	176	128	128
q7	1901	1804	1582	1582
q8	2412	2206	2196	2196
q9	8070	7877	7388	7388
q10	4711	4682	4232	4232
q11	542	396	381	381
q12	725	753	533	533
q13	3029	3374	2816	2816
q14	269	281	261	261
q15	q16	678	706	610	610
q17	1304	1291	1278	1278
q18	7112	6788	6761	6761
q19	1147	1123	1089	1089
q20	2219	2226	1927	1927
q21	5311	4652	4607	4607
q22	516	463	427	427
Total cold run time: 57168 ms
Total hot run time: 51762 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169509 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 03b53d22808358ec2d00a2a89efa09610b7936a1, data reload: false

query5	4348	651	478	478
query6	446	198	186	186
query7	4836	591	312	312
query8	372	214	211	211
query9	8747	4099	4065	4065
query10	451	316	268	268
query11	5904	2386	2191	2191
query12	157	103	98	98
query13	1254	599	423	423
query14	6404	5430	5140	5140
query14_1	4455	4386	4425	4386
query15	209	195	174	174
query16	1024	449	437	437
query17	1104	702	599	599
query18	2514	453	342	342
query19	203	181	146	146
query20	119	116	105	105
query21	216	138	112	112
query22	13901	13704	13485	13485
query23	17220	16406	16190	16190
query23_1	16251	16388	16319	16319
query24	7545	1761	1302	1302
query24_1	1302	1326	1293	1293
query25	547	436	377	377
query26	1310	308	172	172
query27	2670	567	340	340
query28	4511	2073	2042	2042
query29	1093	641	499	499
query30	314	251	205	205
query31	1142	1095	959	959
query32	136	65	65	65
query33	537	356	263	263
query34	1190	1183	653	653
query35	764	804	695	695
query36	1409	1394	1196	1196
query37	160	110	97	97
query38	3230	3160	3065	3065
query39	962	915	883	883
query39_1	889	884	883	883
query40	232	128	110	110
query41	73	68	68	68
query42	95	100	103	100
query43	321	322	281	281
query44	
query45	203	192	182	182
query46	1087	1237	745	745
query47	2339	2382	2252	2252
query48	412	443	284	284
query49	649	475	383	383
query50	971	369	268	268
query51	4351	4313	4210	4210
query52	89	92	80	80
query53	251	275	192	192
query54	282	238	215	215
query55	80	79	72	72
query56	267	237	245	237
query57	1442	1390	1316	1316
query58	263	233	230	230
query59	1603	1636	1434	1434
query60	306	275	243	243
query61	189	183	180	180
query62	693	653	584	584
query63	235	187	192	187
query64	2619	795	620	620
query65	
query66	1770	455	355	355
query67	29744	29722	29572	29572
query68	
query69	430	318	263	263
query70	994	953	1015	953
query71	309	236	212	212
query72	2958	2695	2389	2389
query73	884	810	438	438
query74	5127	4953	4758	4758
query75	2668	2573	2225	2225
query76	2348	1171	762	762
query77	353	393	284	284
query78	12473	12431	11932	11932
query79	1450	1049	813	813
query80	908	471	386	386
query81	502	284	243	243
query82	775	157	125	125
query83	355	276	255	255
query84	260	141	112	112
query85	949	542	448	448
query86	429	320	290	290
query87	3403	3309	3232	3232
query88	3661	2776	2752	2752
query89	434	371	332	332
query90	1800	190	182	182
query91	178	169	138	138
query92	63	61	55	55
query93	1533	1521	840	840
query94	649	361	323	323
query95	678	392	440	392
query96	1006	835	325	325
query97	2671	2686	2556	2556
query98	215	205	204	204
query99	1152	1172	1033	1033
Total cold run time: 252435 ms
Total hot run time: 169509 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29222 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit a3dd82f8b70540237a905718ed7371dd5bbc8625, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17613	4022	3999	3999
q2	q3	10766	1398	795	795
q4	4685	481	343	343
q5	7546	864	613	613
q6	182	177	140	140
q7	781	855	628	628
q8	9395	1533	1484	1484
q9	5960	4470	4488	4470
q10	6744	1808	1508	1508
q11	425	275	247	247
q12	638	432	291	291
q13	18126	3328	2742	2742
q14	272	256	245	245
q15	q16	818	772	714	714
q17	1002	911	925	911
q18	7034	5795	5658	5658
q19	1320	1212	1103	1103
q20	536	391	263	263
q21	6426	2845	2755	2755
q22	460	372	313	313
Total cold run time: 100729 ms
Total hot run time: 29222 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	5142	4751	4827	4751
q2	q3	4855	5236	4737	4737
q4	2108	2188	1372	1372
q5	4744	4802	4595	4595
q6	231	182	139	139
q7	1863	1865	1509	1509
q8	2407	2072	2099	2072
q9	8035	7491	7454	7454
q10	4705	4683	4166	4166
q11	522	386	348	348
q12	725	736	524	524
q13	2934	3342	2803	2803
q14	274	279	267	267
q15	q16	674	692	616	616
q17	1293	1255	1249	1249
q18	7519	6865	6853	6853
q19	1086	1122	1079	1079
q20	2210	2196	1947	1947
q21	5246	4644	4368	4368
q22	517	459	410	410
Total cold run time: 57090 ms
Total hot run time: 51259 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169585 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit a3dd82f8b70540237a905718ed7371dd5bbc8625, data reload: false

query5	4341	628	484	484
query6	450	205	188	188
query7	4868	539	303	303
query8	370	221	203	203
query9	8749	4083	4065	4065
query10	473	314	265	265
query11	5924	2344	2125	2125
query12	157	103	100	100
query13	1252	629	451	451
query14	6391	5450	5080	5080
query14_1	4417	4453	4406	4406
query15	212	198	176	176
query16	1059	448	463	448
query17	1128	737	601	601
query18	2560	482	349	349
query19	213	187	146	146
query20	116	112	107	107
query21	218	136	119	119
query22	13696	13611	13471	13471
query23	17355	16470	16166	16166
query23_1	16244	16280	16187	16187
query24	7514	1800	1309	1309
query24_1	1333	1323	1324	1323
query25	608	491	414	414
query26	1311	325	169	169
query27	2672	564	338	338
query28	4470	2037	2001	2001
query29	1089	624	489	489
query30	310	240	199	199
query31	1128	1086	946	946
query32	115	67	61	61
query33	547	325	281	281
query34	1172	1159	683	683
query35	752	772	678	678
query36	1338	1368	1232	1232
query37	153	105	83	83
query38	3228	3131	3063	3063
query39	932	924	900	900
query39_1	879	878	875	875
query40	217	123	101	101
query41	64	60	61	60
query42	99	93	94	93
query43	330	332	284	284
query44	
query45	197	187	178	178
query46	1060	1181	712	712
query47	2308	2447	2196	2196
query48	377	413	296	296
query49	634	472	352	352
query50	978	350	263	263
query51	4279	4274	4315	4274
query52	89	86	76	76
query53	244	275	195	195
query54	275	218	201	201
query55	77	74	69	69
query56	270	222	224	222
query57	1439	1371	1307	1307
query58	255	219	206	206
query59	1621	1666	1419	1419
query60	291	249	270	249
query61	158	146	154	146
query62	702	644	597	597
query63	233	184	184	184
query64	2577	778	619	619
query65	
query66	1782	468	359	359
query67	29736	29730	29598	29598
query68	
query69	422	308	263	263
query70	966	963	952	952
query71	299	218	215	215
query72	2942	2754	2376	2376
query73	835	796	438	438
query74	5162	4975	4800	4800
query75	2661	2554	2263	2263
query76	2305	1148	758	758
query77	358	379	280	280
query78	12490	12432	11857	11857
query79	1383	1052	779	779
query80	578	472	395	395
query81	454	274	240	240
query82	570	152	119	119
query83	355	270	248	248
query84	272	144	115	115
query85	914	524	434	434
query86	369	298	305	298
query87	3364	3385	3201	3201
query88	3624	2728	2720	2720
query89	433	378	326	326
query90	1988	186	190	186
query91	179	166	136	136
query92	66	62	59	59
query93	1520	1461	881	881
query94	550	359	322	322
query95	683	471	351	351
query96	1060	818	336	336
query97	2693	2696	2591	2591
query98	214	212	206	206
query99	1155	1169	1004	1004
Total cold run time: 251389 ms
Total hot run time: 169585 ms

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Jun 3, 2026

/review

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Jun 3, 2026

run buildall

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one additional correctness issue distinct from the existing reader-ownership and fire-and-forget threads. The PR goal is clear, but the current implementation still does not fully guarantee safe handoff between CDC tasks because release cleanup can close a replacement task's job-level stream-load state.

Critical checkpoint conclusions:

  • Goal/test: The startup timeout and FirstErrorMsg surfacing are covered by targeted tests, but stale-reader release handoff remains insufficiently tested for concurrent release vs replacement task startup.
  • Scope/focus: The code is mostly focused on from-to CDC streaming, but release now affects shared per-job stream-load lifecycle.
  • Concurrency: This PR is concurrency-sensitive. Reader ownership is guarded, but batchStreamLoadMap remains keyed only by job id and is closed without task/generation validation. Existing inline threads already cover the fire-and-forget release and acquire/claim reader races; the new inline comment covers a separate stream-load race after reader detach.
  • Lifecycle: Reader release removes the reader context but leaves stream-load cleanup to a separate job-id-only map, creating a lifecycle mismatch between old and replacement tasks.
  • Compatibility/config: taskTimeoutMs is additive in the JSON request path; no storage-format compatibility concern found.
  • Parallel paths: Snapshot/TVF paths appear intentionally excluded from the new release timeout behavior.
  • Tests: Unit/regression coverage exists for error-message parsing and simple Env no-op cases, but there is no test for release racing a replacement task or for ownership of stream-load cleanup.
  • Observability: Logs include job/task identifiers for new release paths.
  • Transaction/persistence/data correctness: The stream-load cleanup race can interrupt active writes for the replacement task, so correctness is not yet acceptable.

User focus: No additional user-provided review focus was specified.

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Jun 3, 2026

run buildall

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Jun 3, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found no additional distinct inline comments beyond the existing review threads, but I cannot approve the PR while the previously raised release-handoff correctness issue remains unresolved. In the current head, StreamingMultiTblTask.releaseRemoteReader() still sends /api/releaseReader/{taskId} fire-and-forget and immediately continues onFail()/reschedule handling, so the next task can start on another backend before the old backend has actually stopped its CDC reader. The PR body calls this a known limitation, but for the from-to CDC path it still allows competing readers for the same source/replication slot until retry/self-heal, which is the failure mode this change is intended to prevent. Please resolve the existing inline thread at fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/StreamingMultiTblTask.java rather than adding a duplicate comment.

Review focus: no additional user-provided focus points.

Critical checkpoint conclusions:

  • Goal/test: The PR addresses setup timeout, stale-reader release, and surfacing FirstErrorMsg; tests cover response parsing and some Env no-op cases, but the release handoff correctness remains unproven and currently incomplete.
  • Scope: The code is mostly focused on the streaming CDC path.
  • Concurrency: This PR is concurrency-sensitive. The reader claim/detach race and stream-load cleanup race from prior threads appear addressed in the latest head, but the asynchronous FE-to-BE release before reschedule is still a correctness gap.
  • Lifecycle: Reader release intentionally keeps source-side slot state and detaches the backend context; normal /api/close remains responsible for final cleanup. No additional lifecycle issue found beyond the handoff timing.
  • Configuration: No new configuration items. taskTimeoutMs is derived from existing config.
  • Compatibility/protocol: FE sends a new taskTimeoutMs request field and a new cdc_client endpoint path; no persisted format change found.
  • Parallel paths: Scoped to StreamingMultiTblTask/from-to CDC as described; snapshot timeout is explicitly excluded.
  • Conditional checks: The setup-timeout condition is limited to binlog/stream split before first record; offset extraction uses the starting offset, so I did not find a data-loss issue there.
  • Test coverage/results: Unit/regression coverage was added for FirstErrorMsg and limited Env behavior, but there is no test/mechanism proving that reschedule cannot overlap with the previous reader.
  • Observability: Added logs are sufficient for the new paths.
  • Transactions/persistence/data writes: No direct transaction metadata change; late commit remains guarded by task id, but overlapping source readers are still possible until the release completes.
  • FE/BE variables: taskTimeoutMs is passed on the existing request object and used by cdc_client.
  • Performance: No new obvious hot-path performance issue found in the reviewed scope.

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 0.00% (0/27) 🎉
Increment coverage report
Complete coverage report

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Jun 3, 2026

run nonConcurrent

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/149) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29391 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit c0a605468c83a94eee03c91de071452ec66a7bee, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17627	4151	4074	4074
q2	q3	10764	1425	838	838
q4	4731	473	360	360
q5	7995	873	597	597
q6	267	181	142	142
q7	864	837	646	646
q8	10753	1751	1664	1664
q9	7201	4560	4619	4560
q10	6845	1841	1534	1534
q11	442	274	256	256
q12	652	435	312	312
q13	18215	3490	2798	2798
q14	269	262	234	234
q15	q16	826	789	705	705
q17	1014	940	993	940
q18	7068	5863	5582	5582
q19	1163	1325	1018	1018
q20	502	419	257	257
q21	5735	2629	2566	2566
q22	421	359	308	308
Total cold run time: 103354 ms
Total hot run time: 29391 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4467	4490	4405	4405
q2	q3	4574	4957	4318	4318
q4	2205	2238	1424	1424
q5	4502	4353	4670	4353
q6	261	213	160	160
q7	2028	1888	1679	1679
q8	2529	2303	2237	2237
q9	8075	8037	8041	8037
q10	4836	4906	4305	4305
q11	605	405	396	396
q12	778	753	539	539
q13	3437	3715	3030	3030
q14	300	317	290	290
q15	q16	732	782	657	657
q17	1376	1414	1374	1374
q18	8024	7150	6848	6848
q19	1140	1117	1156	1117
q20	2250	2242	1940	1940
q21	5371	4647	4585	4585
q22	557	469	415	415
Total cold run time: 58047 ms
Total hot run time: 52109 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 170169 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit c0a605468c83a94eee03c91de071452ec66a7bee, data reload: false

query5	4321	635	505	505
query6	446	215	192	192
query7	4857	579	315	315
query8	369	225	219	219
query9	8748	4099	4091	4091
query10	450	313	269	269
query11	5802	2355	2175	2175
query12	156	99	98	98
query13	1263	625	466	466
query14	6400	5449	5103	5103
query14_1	4521	4440	4475	4440
query15	212	200	177	177
query16	1014	465	479	465
query17	1149	750	622	622
query18	2628	513	367	367
query19	210	195	152	152
query20	117	114	123	114
query21	217	147	123	123
query22	13716	13695	13291	13291
query23	17382	16602	16189	16189
query23_1	16341	16396	16344	16344
query24	7520	1815	1302	1302
query24_1	1308	1326	1315	1315
query25	579	461	382	382
query26	1291	307	169	169
query27	2672	597	336	336
query28	4433	2042	2041	2041
query29	1065	615	484	484
query30	314	236	197	197
query31	1123	1084	958	958
query32	107	61	56	56
query33	518	319	254	254
query34	1195	1187	660	660
query35	770	797	682	682
query36	1385	1345	1281	1281
query37	169	109	94	94
query38	3222	3170	3055	3055
query39	936	925	893	893
query39_1	897	877	865	865
query40	228	126	102	102
query41	65	63	61	61
query42	96	94	92	92
query43	323	328	296	296
query44	
query45	197	185	179	179
query46	1103	1209	754	754
query47	2333	2415	2233	2233
query48	381	430	324	324
query49	635	484	365	365
query50	980	375	257	257
query51	4354	4279	4268	4268
query52	87	88	77	77
query53	242	273	200	200
query54	266	221	204	204
query55	80	75	70	70
query56	244	237	233	233
query57	1431	1397	1327	1327
query58	242	221	208	208
query59	1633	1681	1409	1409
query60	285	252	249	249
query61	168	158	164	158
query62	695	659	575	575
query63	234	191	190	190
query64	2554	811	668	668
query65	
query66	1796	466	355	355
query67	29786	29707	29676	29676
query68	
query69	429	313	277	277
query70	983	989	979	979
query71	301	226	211	211
query72	3077	2804	2351	2351
query73	861	754	424	424
query74	5179	4949	4803	4803
query75	2682	2587	2239	2239
query76	2308	1168	823	823
query77	370	386	293	293
query78	12460	12380	11871	11871
query79	1326	1106	792	792
query80	556	492	389	389
query81	456	294	260	260
query82	246	161	124	124
query83	276	285	256	256
query84	265	145	113	113
query85	930	608	533	533
query86	333	310	284	284
query87	3369	3370	3235	3235
query88	3713	2815	2750	2750
query89	420	387	336	336
query90	2165	190	190	190
query91	195	183	155	155
query92	69	65	58	58
query93	1469	1439	916	916
query94	540	376	313	313
query95	697	488	379	379
query96	1018	802	369	369
query97	2692	2695	2547	2547
query98	215	217	205	205
query99	1163	1176	1055	1055
Total cold run time: 251318 ms
Total hot run time: 170169 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29177 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 2dbd325203608168255a200762170df5c1dac6a9, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17645	4064	4218	4064
q2	q3	10849	1440	800	800
q4	4696	474	350	350
q5	7614	866	615	615
q6	181	177	138	138
q7	777	833	639	639
q8	9374	1693	1664	1664
q9	5866	4429	4487	4429
q10	6780	1831	1521	1521
q11	429	269	253	253
q12	638	427	310	310
q13	18204	3360	2796	2796
q14	267	258	243	243
q15	q16	797	779	711	711
q17	973	928	984	928
q18	6910	5619	5469	5469
q19	1303	1228	1141	1141
q20	516	407	267	267
q21	6043	2659	2531	2531
q22	451	357	308	308
Total cold run time: 100313 ms
Total hot run time: 29177 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4455	4314	4388	4314
q2	q3	4524	4950	4340	4340
q4	2069	2218	1389	1389
q5	4455	4317	4316	4316
q6	230	174	131	131
q7	1741	1646	2078	1646
q8	2999	2230	2371	2230
q9	8210	8401	7917	7917
q10	4791	4785	4301	4301
q11	573	419	410	410
q12	766	765	555	555
q13	3364	3663	2991	2991
q14	303	301	290	290
q15	q16	709	757	675	675
q17	1376	1334	1346	1334
q18	7992	7367	7364	7364
q19	1194	1166	1095	1095
q20	2205	2220	1946	1946
q21	5311	4592	4456	4456
q22	540	469	414	414
Total cold run time: 57807 ms
Total hot run time: 52114 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169396 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 2dbd325203608168255a200762170df5c1dac6a9, data reload: false

query5	4352	649	500	500
query6	456	199	176	176
query7	4956	589	290	290
query8	369	218	201	201
query9	8774	4079	4083	4079
query10	448	312	262	262
query11	5999	2347	2180	2180
query12	159	103	111	103
query13	1283	627	442	442
query14	6415	5415	5093	5093
query14_1	4447	4486	4431	4431
query15	212	201	178	178
query16	985	456	402	402
query17	958	712	593	593
query18	2449	469	354	354
query19	210	192	145	145
query20	110	107	107	107
query21	216	142	120	120
query22	13634	13660	13464	13464
query23	17468	16506	16145	16145
query23_1	16238	16343	16338	16338
query24	7599	1810	1355	1355
query24_1	1324	1337	1345	1337
query25	579	478	408	408
query26	1299	356	179	179
query27	2678	563	344	344
query28	4513	2032	2025	2025
query29	1104	637	511	511
query30	316	236	203	203
query31	1130	1073	966	966
query32	111	63	62	62
query33	533	331	269	269
query34	1215	1154	641	641
query35	761	782	686	686
query36	1376	1455	1237	1237
query37	159	101	94	94
query38	3198	3152	3054	3054
query39	927	921	891	891
query39_1	863	867	868	867
query40	235	127	103	103
query41	63	64	61	61
query42	95	94	97	94
query43	322	327	280	280
query44	
query45	193	187	180	180
query46	1113	1224	733	733
query47	2398	2338	2213	2213
query48	407	435	310	310
query49	621	484	352	352
query50	969	369	255	255
query51	4433	4265	4259	4259
query52	87	99	80	80
query53	251	272	195	195
query54	280	219	213	213
query55	80	81	74	74
query56	235	224	223	223
query57	1432	1396	1301	1301
query58	247	208	209	208
query59	1591	1637	1418	1418
query60	278	250	228	228
query61	203	153	156	153
query62	688	648	539	539
query63	235	194	183	183
query64	2560	803	635	635
query65	
query66	1824	459	342	342
query67	29791	29791	29562	29562
query68	
query69	422	308	271	271
query70	985	943	917	917
query71	306	231	213	213
query72	3040	2673	2427	2427
query73	886	787	434	434
query74	5146	4971	4780	4780
query75	2668	2602	2240	2240
query76	2338	1182	781	781
query77	382	380	274	274
query78	12288	12448	11877	11877
query79	1386	1109	793	793
query80	587	474	403	403
query81	458	279	242	242
query82	565	158	126	126
query83	358	291	252	252
query84	314	146	113	113
query85	872	545	432	432
query86	369	297	275	275
query87	3366	3335	3185	3185
query88	3734	2807	2800	2800
query89	442	388	330	330
query90	2019	184	184	184
query91	178	172	133	133
query92	62	65	60	60
query93	1516	1402	875	875
query94	529	354	315	315
query95	702	486	353	353
query96	1079	796	379	379
query97	2683	2677	2578	2578
query98	210	207	205	205
query99	1175	1164	1036	1036
Total cold run time: 251933 ms
Total hot run time: 169396 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/886) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants