Update ocdebug to ssh#31216
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
WalkthroughThe kernel-panic recovery test now uses hypervisor-mediated two-hop SSH for post-crash verification. It builds SSH known_hosts, queries survived-node pacemaker/etcd metadata via SSH, and performs SSH-based Eventually assertions on the target node's containers and pacemaker logs. ChangesSSH-based post-crash verification
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Caution Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional.
❌ Failed checks (2 errors, 1 warning)
✅ Passed checks (12 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: kasturinarra The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/extended/edge_topologies/tnf_recovery.go`:
- Around line 427-438: The test currently only checks that hypervisor config
text exists; add an explicit fast-fail validation after
exutil.GetHypervisorConfig() that verifies required SSH fields
(sshCfg.HypervisorIP, sshCfg.SSHUser, sshCfg.PrivateKeyPath) are non-empty and
that the PrivateKeyPath is readable (e.g., os.Stat or attempt to open) before
constructing core.SSHConfig and calling core.PrepareLocalKnownHostsFile; on
validation failure call g.Skip or o.Expect/fatal with a clear message so the
test fails fast instead of later inside Eventually loops.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 081ff095-daf5-4eea-9d2f-b98cd1acf06f
📒 Files selected for processing (1)
test/extended/edge_topologies/tnf_recovery.go
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
test/extended/edge_topologies/tnf_recovery.go (1)
573-585:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMissing
sudoforpodmancommand.Consistent with the previous comment about line 560, this
podman pscommand likely needssudowhen running over SSH as thecoreuser to access system containers.🐛 Proposed fix to add sudo
prevOutput, _, err := core.ExecuteRemoteSSHCommand(targetNodeIP, - "podman ps -a --format '{{.Names}}' | grep -m1 etcd-previous", + "sudo podman ps -a --format '{{.Names}}' | grep -m1 etcd-previous", &hypervisorConfig, localKH, targetRemoteKH)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/edge_topologies/tnf_recovery.go` around lines 573 - 585, The SSH command invoking podman in the Eventually block uses "podman ps -a ..." without sudo; update the command string passed to core.ExecuteRemoteSSHCommand (the call in the anonymous func used by o.Eventually) to prefix podman with sudo (e.g., "sudo podman ps -a --format ... | grep -m1 etcd-previous") so the core user can list system containers; keep the rest of the call and error handling (prevOutput, targetNodeIP, hypervisorConfig, localKH, targetRemoteKH, and the expectation message referencing targetNode.Name) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/extended/edge_topologies/tnf_recovery.go`:
- Around line 517-519: The SSH reads of /var/lib/etcd in the tnf_recovery test
are missing sudo and can fail with permission denied; update the commands passed
to core.ExecuteRemoteSSHCommand (the grep that sets bumpAmountStr and the jq
that reads revision.json) to prefix them with sudo (and keep the existing sudo
on journalctl) so the reads run as root when executed on RHCOS; ensure you
update both command strings that reference /var/lib/etcd/config.yaml and
/var/lib/etcd/revision.json where survivedNodeIP is used.
---
Duplicate comments:
In `@test/extended/edge_topologies/tnf_recovery.go`:
- Around line 573-585: The SSH command invoking podman in the Eventually block
uses "podman ps -a ..." without sudo; update the command string passed to
core.ExecuteRemoteSSHCommand (the call in the anonymous func used by
o.Eventually) to prefix podman with sudo (e.g., "sudo podman ps -a --format ...
| grep -m1 etcd-previous") so the core user can list system containers; keep the
rest of the call and error handling (prevOutput, targetNodeIP, hypervisorConfig,
localKH, targetRemoteKH, and the expectation message referencing
targetNode.Name) unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: ba461195-f6ea-478f-8ea5-06193e78151a
📒 Files selected for processing (1)
test/extended/edge_topologies/tnf_recovery.go
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/extended/edge_topologies/tnf_recovery.go`:
- Around line 573-585: The grep used in the Eventually block can match
substrings causing flaky matches; update the command passed to
core.ExecuteRemoteSSHCommand to match the container name exactly (for example
use grep -x '^etcd-previous$' or an exact-name filter) so prevOutput will only
be "etcd-previous"; update the check around prevOutput and the error message in
that block (referencing core.ExecuteRemoteSSHCommand, targetNodeIP, prevOutput,
targetNode.Name) to rely on the exact-match output.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 310c0f00-29a6-4623-a28a-65ac1f84b85f
📒 Files selected for processing (1)
test/extended/edge_topologies/tnf_recovery.go
|
Scheduling required tests: Scheduling tests matching the |
|
/test e2e-metal-ovn-two-node-fencing-recovery e2e-aws-ovn-fips |
|
/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-metal-ovn-two-node-fencing-ipv6-recovery |
|
@kasturinarra: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/79bf7300-59a2-11f1-92ea-10dc971041d4-0 |
|
/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-metal-ovn-two-node-fencing-recovery |
|
@kasturinarra: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/52711730-59b2-11f1-84dc-0186a3c34230-0 |
|
/test e2e-metal-ovn-two-node-fencing-recovery |
|
/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-metal-ovn-two-node-fencing-ipv6-recovery |
|
/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-metal-ovn-two-node-fencing-recovery |
|
@kasturinarra: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/6bbea980-59eb-11f1-9890-9e3acead88fd-0 |
|
@kasturinarra: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/744ff5e0-59eb-11f1-86c5-e96352f08f90-0 |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
test/extended/edge_topologies/tnf_recovery.go (1)
427-429: ⚡ Quick winMark this spec as hypervisor-gated.
This block adds the same hard prerequisite as the other cold-boot specs here, but the
g.It(...)title still lacks[Requires:HypervisorSSHConfig]. Add the marker so filtering and skip reporting stay accurate.♻️ Suggested tweak
-g.It("should compute etcd revision bump and preserve backup container after kernel panic recovery", func() { +g.It("should compute etcd revision bump and preserve backup container after kernel panic recovery [Requires:HypervisorSSHConfig]", func() {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/edge_topologies/tnf_recovery.go` around lines 427 - 429, The test adds a hypervisor SSH prerequisite using exutil.HasHypervisorConfig() and g.Skip, but the g.It(...) test title is missing the gating marker; update the g.It test description to include "[Requires:HypervisorSSHConfig]" (so the test title matches the runtime skip) — locate the g.It(...) block associated with this exutil.HasHypervisorConfig()/g.Skip check and append the marker to its description string.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/extended/edge_topologies/tnf_recovery.go`:
- Around line 496-499: The SSH helpers (ExecuteRemoteSSHCommand and
ExecuteSSHCommand) currently spawn exec.Command("ssh", ...) with no per-attempt
deadline, which can hang the test; update these helpers to time-bound every
two-hop invocation by using exec.CommandContext with a context created via
context.WithTimeout for each call and/or append safe SSH options such as "-o
BatchMode=yes" and "-o ConnectTimeout=<seconds>" (and optionally "-o
ServerAliveInterval=<sec>") to the constructed ssh arguments so nested SSH calls
also time out; ensure every call site that builds a nested ssh command
(including ExecuteRemoteSSHCommand) uses the new context-aware invocation so a
hung node returns an error quickly and allows the Eventually retry loop to
proceed.
---
Nitpick comments:
In `@test/extended/edge_topologies/tnf_recovery.go`:
- Around line 427-429: The test adds a hypervisor SSH prerequisite using
exutil.HasHypervisorConfig() and g.Skip, but the g.It(...) test title is missing
the gating marker; update the g.It test description to include
"[Requires:HypervisorSSHConfig]" (so the test title matches the runtime skip) —
locate the g.It(...) block associated with this
exutil.HasHypervisorConfig()/g.Skip check and append the marker to its
description string.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 7d3499b5-7b1a-466a-b7f6-fea0c9f1e262
📒 Files selected for processing (1)
test/extended/edge_topologies/tnf_recovery.go
| journalOutput, _, err := core.ExecuteRemoteSSHCommand(survivedNodeIP, | ||
| fmt.Sprintf("sudo journalctl -u pacemaker --since '%s' | grep 'bump-amount' | tail -1", crashTimestamp), | ||
| &hypervisorConfig, localKH, survivedRemoteKH) | ||
| if err != nil { |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== SSH helper timeout-related code =="
rg -n -C4 'func ExecuteSSHCommand|func ExecuteRemoteSSHCommand|ConnectTimeout|BatchMode|ServerAliveInterval|ServerAliveCountMax|CommandContext|WithTimeout|context\.' test/extended/edge_topologies/utils/core/ssh.go
echo
echo "== SSH-based recovery call sites =="
sed -n '493,594p' test/extended/edge_topologies/tnf_recovery.go | awk '{print NR+492 ":" $0}'Repository: openshift/origin
Length of output: 6907
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== ssh.go: ExecuteSSHCommand (key options) =="
sed -n '120,240p' test/extended/edge_topologies/utils/core/ssh.go
echo
echo "== ssh.go: ExecuteRemoteSSHCommand (nestedSSHCommand) =="
sed -n '180,260p' test/extended/edge_topologies/utils/core/ssh.go
echo
echo "== ssh.go: timeout/cancel related usage =="
rg -n 'ConnectTimeout|BatchMode|ServerAliveInterval|ServerAliveCountMax|CommandContext|WithTimeout|context\.' test/extended/edge_topologies/utils/core/ssh.goRepository: openshift/origin
Length of output: 8508
Bound each two-hop SSH attempt.
test/extended/edge_topologies/utils/core/ssh.go’s ExecuteRemoteSSHCommand just builds a nested ssh and calls ExecuteSSHCommand, which runs exec.Command("ssh", ...) with no ConnectTimeout/BatchMode/keepalive options and no CommandContext/deadline. A hung (half-booted) node can therefore block the Eventually callback past the assertion’s retry window instead of returning and retrying.
Make each SSH invocation time-bounded (e.g., switch to exec.CommandContext with a per-attempt timeout, and/or add ssh options like -o BatchMode=yes and -o ConnectTimeout=<...> in the helper).
Also applies to the core.ExecuteRemoteSSHCommand call sites at: 517-519, 533-535, 558-570, 573-585, 589-591.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/extended/edge_topologies/tnf_recovery.go` around lines 496 - 499, The
SSH helpers (ExecuteRemoteSSHCommand and ExecuteSSHCommand) currently spawn
exec.Command("ssh", ...) with no per-attempt deadline, which can hang the test;
update these helpers to time-bound every two-hop invocation by using
exec.CommandContext with a context created via context.WithTimeout for each call
and/or append safe SSH options such as "-o BatchMode=yes" and "-o
ConnectTimeout=<seconds>" (and optionally "-o ServerAliveInterval=<sec>") to the
constructed ssh arguments so nested SSH calls also time out; ensure every call
site that builds a nested ssh command (including ExecuteRemoteSSHCommand) uses
the new context-aware invocation so a hung node returns an error quickly and
allows the Eventually retry loop to proceed.
There was a problem hiding this comment.
The SSH calls are already inside Eventually loops with 5-20 minute timeouts and 5-second poll intervals. If an SSH call hangs, the Eventually will time out and fail the
test. Adding CommandContext or ConnectTimeout to the shared SSH helpers would change behavior for all callers across the test suite (cold-boot tests, node replacement tests, etc.) — that's
a much larger blast radius than the kernel panic fix we're making. The finding is valid in principle but it's not causing the current failures and shouldn't be mixed into this PR.
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Resolve the CodeRabbit review comment on this PR
gh api \
--method PUT \
-H "Accept: application/vnd.github+json" \
"repos/openshift/origin/pulls/31216/reviews" 2>/dev/null || true
# Find and resolve the specific review comment (ID from the URL: r3316838694)
gh api \
--method PATCH \
-H "Accept: application/vnd.github+json" \
"repos/openshift/origin/pulls/comments/3316838694/replies" 2>/dev/null || true
# List review threads to find the right one to resolve
gh api \
--method GET \
-H "Accept: application/vnd.github.v3+json" \
"repos/openshift/origin/pulls/31216/comments" \
--jq '.[] | select(.user.login == "coderabbitai[bot]") | {id: .id, body: .body[0:80]}'Length of output: 1595
🏁 Script executed:
#!/bin/bash
# Use GraphQL to find the review thread and resolve it
# First, get the pull request node ID and review threads
gh api graphql -f query='
{
repository(owner: "openshift", name: "origin") {
pullRequest(number: 31216) {
id
reviewThreads(first: 50) {
nodes {
id
isResolved
comments(first: 1) {
nodes {
databaseId
author { login }
}
}
}
}
}
}
}' --jq '.data.repository.pullRequest.reviewThreads.nodes[] | select(.comments.nodes[0].databaseId == 3316838694) | {threadId: .id, isResolved: .isResolved}'Length of output: 747
🏁 Script executed:
#!/bin/bash
# Resolve the review thread
gh api graphql -f query='
mutation {
resolveReviewThread(input: {threadId: "PRRT_kwDOAVZyrM6FWyxq"}) {
thread {
id
isResolved
}
}
}'Length of output: 705
@kasturinarra, understood — the reasoning is solid and I agree this belongs in a separate PR. I wasn't able to auto-resolve the thread due to integration permissions; you can resolve it manually.
Would you like me to open a follow-up GitHub issue to track adding per-attempt ConnectTimeout/CommandContext to the shared SSH helpers (ExecuteSSHCommand / ExecuteRemoteSSHCommand) so it doesn't get lost?
🧠 Learnings used
Learnt from: Chandan9112
Repo: openshift/origin PR: 31182
File: test/extended/node/node_e2e/node.go:0-0
Timestamp: 2026-05-19T12:28:55.867Z
Learning: When reviewing Go code that uses OpenShift configv1 (`github.com/openshift/api/config/v1`), note that `ImageSpec.RegistrySources` is declared as a non-pointer value (`RegistrySources RegistrySources`). It is always safely zero-initialized by Go, so it can never be `nil`. Therefore, do not add nil-guards for `imageSpec.RegistrySources` before accessing its fields (e.g., `AllowedRegistries`, `ContainerRuntimeSearchRegistries`).
Learnt from: Chandan9112
Repo: openshift/origin PR: 31182
File: test/extended/node/node_e2e/node.go:0-0
Timestamp: 2026-05-19T12:28:55.867Z
Learning: When using the OpenShift `configv1` API (`github.com/openshift/api/config/v1`), treat `ImageSpec.RegistrySources` as a non-nil Go struct value (`RegistrySources`, not `*RegistrySources`). Because it can never be nil (it’s always zero-initialized), don’t add nil-guards before accessing its fields (e.g., `AllowedRegistries`, `ContainerRuntimeSearchRegistries`). You may still need to handle zero-value contents, but a nil check on `RegistrySources` itself is unnecessary.
|
Scheduling required tests: Scheduling tests matching the |
|
/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-metal-ovn-two-node-fencing-ipv6-recovery |
|
@kasturinarra: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f27451f0-5a7c-11f1-95c8-48a54c1d4628-0 |
|
/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-metal-ovn-two-node-fencing-recovery |
|
@kasturinarra: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/07adf990-5a7d-11f1-9a85-aa70fad8c496-0 |
|
/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-metal-ovn-two-node-fencing-ipv6-recovery |
|
@kasturinarra: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/0c388cf0-5a8c-11f1-82ae-b08103e38e90-0 |
|
/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-metal-ovn-two-node-fencing-recovery |
|
@kasturinarra: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/14bf2af0-5a8c-11f1-870f-4c4c78eb719a-0 |
|
/test pull-ci-openshift-origin-release-4.22-e2e-metal-ovn-two-node-fencing-recovery |
1 similar comment
|
/test pull-ci-openshift-origin-release-4.22-e2e-metal-ovn-two-node-fencing-recovery |
|
/retest |
|
@kasturinarra: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-metal-ovn-two-node-fencing-recovery |
|
@kasturinarra: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/04059ae0-5b55-11f1-9f3d-9e6c7cdbfdb9-0 |
Summary by CodeRabbit