Indirect agent connection improvements#13345
Conversation
…ements. - Enhances the Host connecting logic to avoid connecting storm (where Agent opens multiple sockets against Management Server). - Implements HostConnectProcess task where Host upon connection checks whether lock is available, traces Host connecting progress, status and timeout. - Introduces AgentConnectStatusCommand, where Host checks whether lock for the Host is available (i.e. "previous" connect process is finished). - Implementes logic to check whether Management Server has lock against Host (exposed MySQL DB lock presence via API) - Removes synchronization on Host disconnect process, double-disconnect logic in clustered Management Server environment, added early removal from ping map (in case of combination ping timeout delay + synchronized disconnect process the Agent Manager submits more disconnect requests) - Introduces parameterized connection and status check timeouts - Implements backoff algorithm abstraction - can be used either constant backoff timeout or exponential with jitter to wait between connection Host attempts to Management Server - Implements ServerAttache to be used on the Agent side of communication (similar to Attache on Management Server side) - Enhances/Adds logs significantly to Host Agent and Agent Manager logic to trace Host connecting and disconnecting process, including ids, names, context UUIDs and timings (how much time took overall initialization/deinitialization) - Adds logs to communication between Management Servers (PDU requests) - Adds DB indexes to improve search performance, uses IDEMPOTENT_ADD_INDEX for safer DB schema updates
- Bug 1 fix (AgentManagerImpl.java) — GlobalLock.isLockAvailable() now only runs when the host status is not alive. This eliminates one IS_FREE_LOCK DB query per ping per healthy host, which is the direct cause of listHosts/listNetworks degradation. - Bug 2 fix (HostConnectProcess.java) — shutdown() → shutdownNow(). Old thread pools from prior connect cycles are now interrupted immediately instead of draining their queued tasks, preventing thread accumulation during reconnect storms. - Bug 3 fix (AgentManagerImpl.java) — Lock timeout in handleDisconnectWithoutInvestigation now logs a warn instead of silently discarding the disconnect event. - Bug 4 fix (ServerAttache.java) — Alarm ScheduledFuture handles are now tracked in _alarmFutures and cancelled when the corresponding listener is unregistered or all commands are cancelled on disconnect. - Bug 5 fix (AgentAttache.java) - Fix Alarm ScheduledFuture handles in AgentAttache as well
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #13345 +/- ##
============================================
+ Coverage 18.10% 18.13% +0.02%
- Complexity 16750 16801 +51
============================================
Files 6037 6047 +10
Lines 542798 544317 +1519
Branches 66457 66605 +148
============================================
+ Hits 98298 98703 +405
- Misses 433453 434516 +1063
- Partials 11047 11098 +51
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR is a broad refactor/enhancement of indirect agent connection handling across both Management Server and Agent sides. It introduces a host-side connect/status-check process, adds configurable backoff and host-status heuristics, and expands logging/lock-check mechanisms to reduce reconnect storms and improve observability.
Changes:
- Adds agent-side connection orchestration (status polling + startup submission) and a server-side
AgentConnectStatus*command/answer pair to coordinate connect progress. - Introduces configurable backoff (factory + exponential-with-jitter implementation) and propagates configuration from Management Server to Agent during startup.
- Refactors NIO connection lifecycle and adds supporting utilities/logging, plus DB lock availability checks and DB indexes for performance.
Reviewed changes
Copilot reviewed 47 out of 47 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| utils/src/test/java/com/cloud/utils/testcase/NioTest.java | Test logging cleanup (exception logging). |
| utils/src/test/java/com/cloud/utils/backoff/impl/ConstantTimeBackoffTest.java | Updates config key name for constant backoff. |
| utils/src/main/java/com/cloud/utils/nio/NioServer.java | NIO server init tweaks and docstring. |
| utils/src/main/java/com/cloud/utils/nio/NioConnection.java | Connection lifecycle refactor, selector loop/logging, reject logic. |
| utils/src/main/java/com/cloud/utils/nio/NioClient.java | Client connect/handshake logging and expanded cleanup. |
| utils/src/main/java/com/cloud/utils/nio/Link.java | Adds local-port tracking, richer toString, termination helpers/logging changes. |
| utils/src/main/java/com/cloud/utils/nio/HandlerFactory.java | Changes new-connection registration API to InetSocketAddress. |
| utils/src/main/java/com/cloud/utils/net/NetUtils.java | Null guard in hostname-to-IP resolution helper. |
| utils/src/main/java/com/cloud/utils/LogUtils.java | Adds host logging helper with optional reverse lookup. |
| utils/src/main/java/com/cloud/utils/exception/CSExceptionErrorCode.java | Adds error-code mapping for ConnectionException. |
| utils/src/main/java/com/cloud/utils/DateUtil.java | Adds formatMillis duration formatter. |
| utils/src/main/java/com/cloud/utils/backoff/impl/ExponentialWithJitterBackoffMBean.java | New MBean interface for exponential-jitter backoff. |
| utils/src/main/java/com/cloud/utils/backoff/impl/ExponentialWithJitterBackoff.java | New exponential-with-jitter backoff implementation. |
| utils/src/main/java/com/cloud/utils/backoff/impl/ConstantTimeBackoff.java | Migrates to namespaced config keys + exposes configuration. |
| utils/src/main/java/com/cloud/utils/backoff/BackoffFactory.java | New backoff factory for selecting/configuring algorithms. |
| utils/src/main/java/com/cloud/utils/backoff/BackoffAlgorithm.java | Adds getConfiguration() API for propagating settings. |
| server/src/main/java/org/apache/cloudstack/agent/lb/IndirectAgentLBServiceImpl.java | Adds notes/timeout tweaks around agent migration command dispatch. |
| framework/db/src/main/java/com/cloud/utils/db/GlobalLock.java | Refactors lock bookkeeping/logging and adds lock-availability query. |
| framework/db/src/main/java/com/cloud/utils/db/DbUtil.java | Adds IS_FREE_LOCK helper and improves logging around lock DB connections. |
| framework/config/src/main/java/org/apache/cloudstack/framework/config/ConfigKeyUtil.java | New utility to parse key=value;... configuration strings. |
| framework/cluster/src/main/java/com/cloud/cluster/dao/ManagementServerHostDaoImpl.java | Adds DAO method to list all MS hosts including removed. |
| framework/cluster/src/main/java/com/cloud/cluster/dao/ManagementServerHostDao.java | Adds DAO API for “including removed” listing. |
| framework/cluster/src/main/java/com/cloud/cluster/ClusterServiceServletImpl.java | Adds request logging and more detailed RemoteException messages. |
| framework/cluster/src/main/java/com/cloud/cluster/ClusterServiceServletHttpHandler.java | Adds debug logging of inbound request line/body. |
| engine/schema/src/main/resources/META-INF/db/schema-42210to42300.sql | Adds indexes via idempotent add-index procedure calls. |
| engine/schema/src/main/resources/META-INF/db/procedures/cloud.idempotent_add_index.sql | Adds/defines IDEMPOTENT_ADD_INDEX procedure. |
| engine/orchestration/src/test/java/com/cloud/agent/manager/ClusteredAgentManagerImplTest.java | Expands tests for disconnect broadcast behavior and lock usage. |
| engine/orchestration/src/test/java/com/cloud/agent/manager/AgentManagerImplTest.java | Adds tests for config keys and deregister/disconnect behaviors. |
| engine/orchestration/src/main/java/com/cloud/agent/manager/ClusteredAgentManagerImpl.java | Refactors attache creation/removal and event handling behavior. |
| engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java | Major connect/disconnect refactor, new config keys, status checks, backoff propagation. |
| engine/orchestration/src/main/java/com/cloud/agent/manager/AgentAttache.java | Cancels listener alarm futures on unregister/cancel paths. |
| core/src/main/java/org/apache/cloudstack/threadcontext/ThreadContextUtil.java | New helper for propagating Log4j ThreadContext across threads. |
| core/src/main/java/com/cloud/resource/ServerResource.java | Changes default isExitOnFailures() behavior. |
| core/src/main/java/com/cloud/agent/transport/Request.java | Improves JSON deserialization error logging. |
| core/src/main/java/com/cloud/agent/api/StartupAnswer.java | Adds params + agent-side status-check delay transport fields. |
| core/src/main/java/com/cloud/agent/api/AgentConnectStatusCommand.java | New command for host connect-status checks. |
| core/src/main/java/com/cloud/agent/api/AgentConnectStatusAnswer.java | New answer carrying lock availability and host status. |
| agent/src/test/java/com/cloud/agent/HostConnectProcessTest.java | New test around scheduling the agent connect process. |
| agent/src/test/java/com/cloud/agent/AgentTest.java | Updates tests for new link/logging and reconnect helpers. |
| agent/src/main/java/com/cloud/agent/SynchronousListener.java | New blocking listener for synchronous waits on answers. |
| agent/src/main/java/com/cloud/agent/ServerListener.java | New listener interface for agent-side ServerAttache callbacks. |
| agent/src/main/java/com/cloud/agent/ServerAttache.java | New agent-side counterpart to MS Attache for command/answer flow. |
| agent/src/main/java/com/cloud/agent/properties/AgentProperties.java | Adds new agent properties for async timeouts and status-check delay. |
| agent/src/main/java/com/cloud/agent/IAgentShell.java | Adds setter for dynamically updated backoff algorithm. |
| agent/src/main/java/com/cloud/agent/HostConnectProcess.java | New connect/status-check orchestration loop on the agent side. |
| agent/src/main/java/com/cloud/agent/AgentShell.java | Uses backoff factory + persists configuration; relaxes version handling. |
| agent/src/main/java/com/cloud/agent/Agent.java | Major reconnect/startup refactor; integrates HostConnectProcess and ServerAttache. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 18146 |
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress. |
| Map<String, String> backoffConfiguration = ConfigKeyUtil.toMap(BackoffConfiguration.value()); | ||
| StartupAnswer[] answers = new StartupAnswer[cmds.length]; | ||
| Command cmd; | ||
| for (int i = 0; i < cmds.length; i++) { | ||
| cmd = cmds[i]; | ||
| if (cmd instanceof StartupRoutingCommand || cmd instanceof StartupProxyCommand || cmd instanceof StartupSecondaryStorageCommand || | ||
| cmd instanceof StartupStorageCommand) { | ||
| answers[i] = new StartupAnswer((StartupCommand) cmds[i], 0, "", "", mgmtServiceConf.getPingInterval()); | ||
| break; | ||
| if (cmd instanceof StartupRoutingCommand || cmd instanceof StartupProxyCommand || cmd instanceof StartupSecondaryStorageCommand | ||
| || cmd instanceof StartupStorageCommand) { | ||
| StartupAnswer answer = new StartupAnswer((StartupCommand) cmds[i], 0, "", "", mgmtServiceConf.getPingInterval()); | ||
| answer.setParams(backoffConfiguration); | ||
| answer.setAgentHostStatusCheckDelaySec(AgentHostStatusCheckDelay.value()); | ||
| answers[i] = answer; | ||
| } | ||
| } | ||
| Response response; | ||
| response = new Response(request, answers[0], _nodeId, -1); | ||
| Response response = new Response(request, answers, _nodeId, -1); | ||
| try { |
| private AgentConnectStatusAnswer getConnectStatusAnswer(HostVO hostVo, AgentConnectStatusCommand cmd) { | ||
| long hostId = hostVo.getId(); | ||
| String hostName = hostVo.getName(); | ||
| String lockName = getHostJoinLockName(hostId); | ||
| Status status = hostVo.getStatus(); | ||
| try { | ||
| boolean lockAvailable = GlobalLock.isLockAvailable(lockName); | ||
| String details = String.format("Global lock %s is%s present for %s", lockName, lockAvailable ? "" : " not", | ||
| hostName); | ||
| logger.debug(details); | ||
| return getAgentConnectStatusAnswer(cmd, lockName, hostName, lockAvailable, status, details); | ||
| } catch (RuntimeException e) { | ||
| String msg = String.format("Failed to check global lock %s presence for %s: %s", lockName, hostName, | ||
| e.getMessage()); | ||
| logger.warn(msg, e); | ||
| return new AgentConnectStatusAnswer(cmd, false, msg); | ||
| } |
| public static boolean isLockAvailable(String name) { | ||
| if (logger.isDebugEnabled()) { | ||
| logger.debug("Checking lock present for {}", name); | ||
| } | ||
| boolean result = false; | ||
| try { | ||
| result = DbUtil.isFreeLock(name); | ||
| } finally { | ||
| if (logger.isDebugEnabled()) { | ||
| logger.debug("Result of checking lock present for {}: {}", name, result); | ||
| } | ||
| } | ||
| return result; |
| boolean reconnectForCurrentLink = link == this.link; | ||
| boolean currentLinkTerminated = this.link != null && this.link.isTerminated(); | ||
| boolean reconnectForNewHost = this.hostname != null && this.hostname.equals(preferredHost); | ||
| // if none of the above is true | ||
| boolean stormDetected = ! (reconnectForCurrentLink || currentLinkTerminated || reconnectForNewHost); |
| @Test | ||
| public void testScheduleConnectProcess() throws InterruptedException, CloudException { | ||
|
|
||
| hostConnectProcess.scheduleConnectProcess(link, connectionTransfer); | ||
| Assert.assertTrue(hostConnectProcess.isInProgress()); | ||
| } | ||
| } |
| ServerResource serverResource = _agent.getResource(); | ||
| StartupCommand[] startup = serverResource.initialize(); | ||
| if (ArrayUtils.isEmpty(startup)) { |
| public static String getHostLog(String address, Integer port) { | ||
| try { | ||
| InetAddress inetAddress = InetAddress.getByName(address); | ||
| String hostName = inetAddress.getHostName(); | ||
| String ipAddress = inetAddress.getHostAddress(); | ||
| if (port == null) { | ||
| return String.format("%s/%s", ipAddress, hostName); | ||
| } | ||
| return String.format("%s/%s:%s", ipAddress, hostName, port); | ||
| } catch (UnknownHostException e) { | ||
| LOGGER.warn("Failed to resolve name for address {}", address, e); | ||
| } | ||
| if (port == null) { | ||
| return address; | ||
| } | ||
| return String.format("%s:%s", address, port); | ||
| } |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 18149 |
Description
Continued from #13028
This PR improves the Indirect agent connection handling, has the following improvements.
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
Checked the indirect agents connections (and re-connections) with KVM hosts, SSVM & CPVM.
How did you try to break this feature and the system with this change?