CORENET-6410: Add EgressIP disruption monitor test for upgrade validation#31011
CORENET-6410: Add EgressIP disruption monitor test for upgrade validation#31011bpickard22 wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (2)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughAdds a new EgressIP availability monitor test: registers it, implements the monitor that allocates an EgressIP, deploys target and poller workloads to detect SNAT disruptions, and provides a dedicated test namespace manifest. Changes
Sequence DiagramsequenceDiagram
actor Controller as EgressIP Monitor
participant K8sAPI as Kubernetes API
participant Node as Worker Node
participant Target as Target Pod (host-network)
participant Poller as Poller Pod
participant Logs as Pod Logs
Controller->>K8sAPI: Create Namespace
Controller->>K8sAPI: Label selected Node (egress-assignable)
Controller->>K8sAPI: Create EgressIP CR
activate K8sAPI
K8sAPI->>Node: Assign EgressIP to node
deactivate K8sAPI
Controller->>K8sAPI: Deploy Target (host-network) on Node
Controller->>K8sAPI: Deploy Poller
activate Target
activate Poller
loop Continuous monitoring
Poller->>Target: HTTP request (check source IP)
Target-->>Poller: Response (source IP observed)
Poller->>Logs: Emit JSON line {ok: true|false, ts, msg}
end
deactivate Target
deactivate Poller
Controller->>Logs: Read Poller logs
Controller->>Controller: Parse JSON lines -> construct disruption intervals
Controller->>K8sAPI: Delete EgressIP CR
Controller->>K8sAPI: Remove Node label
Controller->>K8sAPI: Delete Namespace
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (7 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: bpickard22 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
We don't currently have any test coverage for EgressIP traffic continuity during cluster upgrades. When nodes roll, OVN controllers restart, and leader elections happen, there are several windows where EgressIP SNAT can briefly stop working -- but nothing catches that today. This adds a new MonitorTest that continuously validates EgressIP SNAT throughout the upgrade process. It sets up a target pod on the host network, creates an EgressIP CR for a test namespace, and runs a poller that checks the source IP seen by the target matches the expected EgressIP. Any SNAT mismatches or connection failures are recorded as disruption intervals. The test skips gracefully on non-OVN clusters, single-node topologies, MicroShift, and clusters without enough annotated worker nodes. It uses Deployments (not bare Pods) so the poller survives node drains, patches node labels to avoid update races, and includes CloudPrivateIPConfig reservations when allocating IPs to prevent cloud-level conflicts. For now, we set a generous 120s disruption threshold to catch catastrophic regressions while we collect baseline data from CI runs. Assisted by Claude Opus 4.6 Signed-off-by: Benjamin Pickard <bpickard@redhat.com>
|
@bpickard22: This pull request references CORENET-6410 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/monitortests/network/disruptionegressip/monitortest.go`:
- Around line 493-495: The test currently pins the target pod with NodeSelector
{"kubernetes.io/hostname": targetNode.Name} and the poller assumes a single
hardcoded node IP, which causes false failures during node drains; remove the
hard NodeSelector pin (stop using targetNode.Name for scheduling) and instead
deploy multiple hostNetwork backends behind a Service (or use a movable/stable
endpoint) so backends can float across nodes, then change the poller logic (the
code that currently hardcodes the node IP in the poller block) to query the
Service/endpoint DNS or iterate over multiple backend endpoints and validate the
SNAT/source IP from whichever backend answers rather than expecting a single
node IP.
- Around line 384-399: The test currently only reads stdout from still-running
poller pods (LabelSelector "app=egressip-disruption-poller" and function
w.collectPodLogs), which loses evidence from pods deleted during restarts;
change the design so probe results are persisted outside pod lifetime and read
from that persistent store instead of relying on live pod logs. Concretely:
modify the poller to write its intervals/results to durable storage (e.g., a
shared PVC/log file, a ConfigMap/CRD, or an external aggregator/metrics
endpoint) and update the monitor code that currently calls w.collectPodLogs and
lists pods to read from that persistent interface (or Kubernetes events/metrics)
so deleted pods’ data are preserved; keep the LabelSelector and pod-listing
logic only as a fallback, and ensure functions like collectPodLogs are adapted
or replaced to query the new persistent source.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: b0859dac-896f-4eb6-9bd1-6205581a28ed
📒 Files selected for processing (3)
pkg/defaultmonitortests/types.gopkg/monitortests/network/disruptionegressip/monitortest.gopkg/monitortests/network/disruptionegressip/namespace.yaml
| pods, err := w.kubeClient.CoreV1().Pods(w.namespaceName).List(ctx, metav1.ListOptions{ | ||
| LabelSelector: "app=egressip-disruption-poller", | ||
| }) | ||
| if err != nil { | ||
| return nil, nil, []error{fmt.Errorf("failed to list poller pods: %w", err)} | ||
| } | ||
| if len(pods.Items) == 0 { | ||
| return nil, nil, []error{fmt.Errorf("no poller pods found")} | ||
| } | ||
|
|
||
| var allIntervals monitorapi.Intervals | ||
| var allErrors []error | ||
| for _, pod := range pods.Items { | ||
| intervals, errs := w.collectPodLogs(ctx, pod.Name) | ||
| allIntervals = append(allIntervals, intervals...) | ||
| allErrors = append(allErrors, errs...) |
There was a problem hiding this comment.
Only scraping surviving poller pods loses earlier disruption evidence.
Lines 384-399 list the poller pods that still exist at collection time, and Lines 415-416 stream only the current container logs from each. Any poller pod deleted during a drain/restart takes its stdout history with it, so outages that happened before the final replica came up are silently dropped. Persist probe results outside pod lifetime, or emit intervals continuously instead of reconstructing them only from final pod logs.
Also applies to: 415-416
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/monitortests/network/disruptionegressip/monitortest.go` around lines 384
- 399, The test currently only reads stdout from still-running poller pods
(LabelSelector "app=egressip-disruption-poller" and function w.collectPodLogs),
which loses evidence from pods deleted during restarts; change the design so
probe results are persisted outside pod lifetime and read from that persistent
store instead of relying on live pod logs. Concretely: modify the poller to
write its intervals/results to durable storage (e.g., a shared PVC/log file, a
ConfigMap/CRD, or an external aggregator/metrics endpoint) and update the
monitor code that currently calls w.collectPodLogs and lists pods to read from
that persistent interface (or Kubernetes events/metrics) so deleted pods’ data
are preserved; keep the LabelSelector and pod-listing logic only as a fallback,
and ensure functions like collectPodLogs are adapted or replaced to query the
new persistent source.
| NodeSelector: map[string]string{ | ||
| "kubernetes.io/hostname": targetNode.Name, | ||
| }, |
There was a problem hiding this comment.
Pinning the target to one worker turns node drains into false egress-ip failures.
Lines 493-495 force the target onto a single node, and Lines 523-542 hardcode that node IP into the poller. If that worker is drained or rebooted during the upgrade, the backend disappears until the same node returns, so the monitor measures target-node availability rather than SNAT availability. Use a stable endpoint that can move with the target, or keep multiple host-network backends behind a Service and validate the source IP from whichever backend answers.
Also applies to: 523-542
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/monitortests/network/disruptionegressip/monitortest.go` around lines 493
- 495, The test currently pins the target pod with NodeSelector
{"kubernetes.io/hostname": targetNode.Name} and the poller assumes a single
hardcoded node IP, which causes false failures during node drains; remove the
hard NodeSelector pin (stop using targetNode.Name for scheduling) and instead
deploy multiple hostNetwork backends behind a Service (or use a movable/stable
endpoint) so backends can float across nodes, then change the poller logic (the
code that currently hardcodes the node IP in the poller block) to query the
Service/endpoint DNS or iterate over multiple backend endpoints and validate the
SNAT/source IP from whichever backend answers rather than expecting a single
node IP.
|
@bpickard22: This pull request references CORENET-6410 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@bpickard22: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
We don't currently have any test coverage for EgressIP traffic continuity during cluster upgrades. When nodes roll, OVN controllers restart, and leader elections happen, there are several windows where EgressIP SNAT can briefly stop working -- but nothing catches that today.
This adds a new MonitorTest that continuously validates EgressIP SNAT throughout the upgrade process. It sets up a target pod on the host network, creates an EgressIP CR for a test namespace, and runs a poller that checks the source IP seen by the target matches the expected EgressIP. Any SNAT mismatches or connection failures are recorded as disruption intervals.
The test skips gracefully on non-OVN clusters, single-node topologies, MicroShift, and clusters without enough annotated worker nodes. It uses Deployments (not bare Pods) so the poller survives node drains, patches node labels to avoid update races, and includes CloudPrivateIPConfig reservations when allocating IPs to prevent cloud-level conflicts.
For now, we set a generous 120s disruption threshold to catch catastrophic regressions while we collect baseline data from CI runs.
Assisted by Claude Opus 4.6
Summary by CodeRabbit