Skip to content

Indirect agent connection improvements#13345

Open
sureshanaparti wants to merge 8 commits into
apache:mainfrom
shapeblue:indirect-agent-connection-improvements
Open

Indirect agent connection improvements#13345
sureshanaparti wants to merge 8 commits into
apache:mainfrom
shapeblue:indirect-agent-connection-improvements

Conversation

@sureshanaparti
Copy link
Copy Markdown
Contributor

@sureshanaparti sureshanaparti commented Jun 4, 2026

Description

Continued from #13028

This PR improves the Indirect agent connection handling, has the following improvements.

  • Enhances the Host connecting logic to avoid connecting storm (where Agent opens multiple sockets against Management Server).
  • Implements HostConnectProcess task where Host upon connection checks whether lock is available, traces Host connecting progress, status and timeout.
  • Introduces AgentConnectStatusCommand, where Host checks whether lock for the Host is available (i.e. "previous" connect process is finished).
  • Implementes logic to check whether Management Server has lock against Host (exposed MySQL DB lock presence via API)
  • Removes synchronization on Host disconnect process, double-disconnect logic in clustered Management Server environment, added early removal from ping map (in case of combination ping timeout delay + synchronized disconnect process the Agent Manager submits more disconnect requests)
  • Introduces parameterized connection and status check timeouts
  • Implements backoff algorithm abstraction - can be used either constant backoff timeout or exponential with jitter to wait between connection Host attempts to Management Server
  • Implements ServerAttache to be used on the Agent side of communication (similar to Attache on Management Server side)
  • Enhances/Adds logs significantly to Host Agent and Agent Manager logic to trace Host connecting and disconnecting process, including ids, names, context UUIDs and timings (how much time took overall initialization/deinitialization)
  • Adds logs to communication between Management Servers (PDU requests)
  • Adds DB indexes to improve search performance, uses IDEMPOTENT_ADD_INDEX for safer DB schema updates

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • Build/CI
  • Test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

Checked the indirect agents connections (and re-connections) with KVM hosts, SSVM & CPVM.

How did you try to break this feature and the system with this change?

mprokopchuk and others added 7 commits June 4, 2026 11:04
…ements.

- Enhances the Host connecting logic to avoid connecting storm (where Agent opens multiple sockets against Management Server).
- Implements HostConnectProcess task where Host upon connection checks whether lock is available, traces Host connecting progress, status and timeout.
- Introduces AgentConnectStatusCommand, where Host checks whether lock for the Host is available (i.e. "previous" connect process is finished).
- Implementes logic to check whether Management Server has lock against Host (exposed MySQL DB lock presence via API)
- Removes synchronization on Host disconnect process, double-disconnect logic in clustered Management Server environment, added early removal from ping map (in case of combination ping timeout delay + synchronized disconnect process the Agent Manager submits more disconnect requests)
- Introduces parameterized connection and status check timeouts
- Implements backoff algorithm abstraction - can be used either constant backoff timeout or exponential with jitter to wait between connection Host attempts to Management Server
- Implements ServerAttache to be used on the Agent side of communication (similar to Attache on Management Server side)
- Enhances/Adds logs significantly to Host Agent and Agent Manager logic to trace Host connecting and disconnecting process, including ids, names, context UUIDs and timings (how much time took overall initialization/deinitialization)
- Adds logs to communication between Management Servers (PDU requests)
- Adds DB indexes to improve search performance, uses IDEMPOTENT_ADD_INDEX for safer DB schema updates
- Bug 1 fix (AgentManagerImpl.java) — GlobalLock.isLockAvailable() now only runs when the host status is not alive. This eliminates one IS_FREE_LOCK DB query per ping per healthy host, which is the direct cause of listHosts/listNetworks degradation.
- Bug 2 fix (HostConnectProcess.java) — shutdown() → shutdownNow(). Old thread pools from prior connect cycles are now interrupted immediately instead of draining their queued tasks, preventing thread accumulation during reconnect storms.
- Bug 3 fix (AgentManagerImpl.java) — Lock timeout in handleDisconnectWithoutInvestigation now logs a warn instead of silently discarding the disconnect event.
- Bug 4 fix (ServerAttache.java) — Alarm ScheduledFuture handles are now tracked in _alarmFutures and cancelled when the corresponding listener is unregistered or all commands are cancelled on disconnect.
- Bug 5 fix (AgentAttache.java) - Fix Alarm ScheduledFuture handles in AgentAttache as well
@sureshanaparti
Copy link
Copy Markdown
Contributor Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress.

@sureshanaparti sureshanaparti added this to the 4.23.0 milestone Jun 4, 2026
@sureshanaparti sureshanaparti requested a review from nvazquez June 4, 2026 06:29
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 4, 2026

Codecov Report

❌ Patch coverage is 19.81132% with 1700 lines in your changes missing coverage. Please review.
✅ Project coverage is 18.13%. Comparing base (be51948) to head (bb91ef2).

Files with missing lines Patch % Lines
agent/src/main/java/com/cloud/agent/Agent.java 9.83% 320 Missing and 1 partial ⚠️
...java/com/cloud/agent/manager/AgentManagerImpl.java 38.97% 264 Missing and 21 partials ⚠️
...t/src/main/java/com/cloud/agent/ServerAttache.java 8.25% 275 Missing and 3 partials ⚠️
.../main/java/com/cloud/agent/HostConnectProcess.java 22.83% 168 Missing and 1 partial ⚠️
...c/main/java/com/cloud/utils/nio/NioConnection.java 25.92% 87 Missing and 13 partials ⚠️
...s/src/main/java/com/cloud/utils/nio/NioClient.java 18.36% 79 Missing and 1 partial ⚠️
...ils/backoff/impl/ExponentialWithJitterBackoff.java 0.00% 72 Missing ⚠️
...b/src/main/java/com/cloud/utils/db/GlobalLock.java 15.66% 51 Missing and 19 partials ⚠️
...main/java/com/cloud/agent/SynchronousListener.java 0.00% 47 Missing ⚠️
...rk/db/src/main/java/com/cloud/utils/db/DbUtil.java 20.51% 31 Missing ⚠️
... and 21 more
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #13345      +/-   ##
============================================
+ Coverage     18.10%   18.13%   +0.02%     
- Complexity    16750    16801      +51     
============================================
  Files          6037     6047      +10     
  Lines        542798   544317    +1519     
  Branches      66457    66605     +148     
============================================
+ Hits          98298    98703     +405     
- Misses       433453   434516    +1063     
- Partials      11047    11098      +51     
Flag Coverage Δ
uitests 3.51% <ø> (ø)
unittests 19.29% <19.81%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR is a broad refactor/enhancement of indirect agent connection handling across both Management Server and Agent sides. It introduces a host-side connect/status-check process, adds configurable backoff and host-status heuristics, and expands logging/lock-check mechanisms to reduce reconnect storms and improve observability.

Changes:

  • Adds agent-side connection orchestration (status polling + startup submission) and a server-side AgentConnectStatus* command/answer pair to coordinate connect progress.
  • Introduces configurable backoff (factory + exponential-with-jitter implementation) and propagates configuration from Management Server to Agent during startup.
  • Refactors NIO connection lifecycle and adds supporting utilities/logging, plus DB lock availability checks and DB indexes for performance.

Reviewed changes

Copilot reviewed 47 out of 47 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
utils/src/test/java/com/cloud/utils/testcase/NioTest.java Test logging cleanup (exception logging).
utils/src/test/java/com/cloud/utils/backoff/impl/ConstantTimeBackoffTest.java Updates config key name for constant backoff.
utils/src/main/java/com/cloud/utils/nio/NioServer.java NIO server init tweaks and docstring.
utils/src/main/java/com/cloud/utils/nio/NioConnection.java Connection lifecycle refactor, selector loop/logging, reject logic.
utils/src/main/java/com/cloud/utils/nio/NioClient.java Client connect/handshake logging and expanded cleanup.
utils/src/main/java/com/cloud/utils/nio/Link.java Adds local-port tracking, richer toString, termination helpers/logging changes.
utils/src/main/java/com/cloud/utils/nio/HandlerFactory.java Changes new-connection registration API to InetSocketAddress.
utils/src/main/java/com/cloud/utils/net/NetUtils.java Null guard in hostname-to-IP resolution helper.
utils/src/main/java/com/cloud/utils/LogUtils.java Adds host logging helper with optional reverse lookup.
utils/src/main/java/com/cloud/utils/exception/CSExceptionErrorCode.java Adds error-code mapping for ConnectionException.
utils/src/main/java/com/cloud/utils/DateUtil.java Adds formatMillis duration formatter.
utils/src/main/java/com/cloud/utils/backoff/impl/ExponentialWithJitterBackoffMBean.java New MBean interface for exponential-jitter backoff.
utils/src/main/java/com/cloud/utils/backoff/impl/ExponentialWithJitterBackoff.java New exponential-with-jitter backoff implementation.
utils/src/main/java/com/cloud/utils/backoff/impl/ConstantTimeBackoff.java Migrates to namespaced config keys + exposes configuration.
utils/src/main/java/com/cloud/utils/backoff/BackoffFactory.java New backoff factory for selecting/configuring algorithms.
utils/src/main/java/com/cloud/utils/backoff/BackoffAlgorithm.java Adds getConfiguration() API for propagating settings.
server/src/main/java/org/apache/cloudstack/agent/lb/IndirectAgentLBServiceImpl.java Adds notes/timeout tweaks around agent migration command dispatch.
framework/db/src/main/java/com/cloud/utils/db/GlobalLock.java Refactors lock bookkeeping/logging and adds lock-availability query.
framework/db/src/main/java/com/cloud/utils/db/DbUtil.java Adds IS_FREE_LOCK helper and improves logging around lock DB connections.
framework/config/src/main/java/org/apache/cloudstack/framework/config/ConfigKeyUtil.java New utility to parse key=value;... configuration strings.
framework/cluster/src/main/java/com/cloud/cluster/dao/ManagementServerHostDaoImpl.java Adds DAO method to list all MS hosts including removed.
framework/cluster/src/main/java/com/cloud/cluster/dao/ManagementServerHostDao.java Adds DAO API for “including removed” listing.
framework/cluster/src/main/java/com/cloud/cluster/ClusterServiceServletImpl.java Adds request logging and more detailed RemoteException messages.
framework/cluster/src/main/java/com/cloud/cluster/ClusterServiceServletHttpHandler.java Adds debug logging of inbound request line/body.
engine/schema/src/main/resources/META-INF/db/schema-42210to42300.sql Adds indexes via idempotent add-index procedure calls.
engine/schema/src/main/resources/META-INF/db/procedures/cloud.idempotent_add_index.sql Adds/defines IDEMPOTENT_ADD_INDEX procedure.
engine/orchestration/src/test/java/com/cloud/agent/manager/ClusteredAgentManagerImplTest.java Expands tests for disconnect broadcast behavior and lock usage.
engine/orchestration/src/test/java/com/cloud/agent/manager/AgentManagerImplTest.java Adds tests for config keys and deregister/disconnect behaviors.
engine/orchestration/src/main/java/com/cloud/agent/manager/ClusteredAgentManagerImpl.java Refactors attache creation/removal and event handling behavior.
engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java Major connect/disconnect refactor, new config keys, status checks, backoff propagation.
engine/orchestration/src/main/java/com/cloud/agent/manager/AgentAttache.java Cancels listener alarm futures on unregister/cancel paths.
core/src/main/java/org/apache/cloudstack/threadcontext/ThreadContextUtil.java New helper for propagating Log4j ThreadContext across threads.
core/src/main/java/com/cloud/resource/ServerResource.java Changes default isExitOnFailures() behavior.
core/src/main/java/com/cloud/agent/transport/Request.java Improves JSON deserialization error logging.
core/src/main/java/com/cloud/agent/api/StartupAnswer.java Adds params + agent-side status-check delay transport fields.
core/src/main/java/com/cloud/agent/api/AgentConnectStatusCommand.java New command for host connect-status checks.
core/src/main/java/com/cloud/agent/api/AgentConnectStatusAnswer.java New answer carrying lock availability and host status.
agent/src/test/java/com/cloud/agent/HostConnectProcessTest.java New test around scheduling the agent connect process.
agent/src/test/java/com/cloud/agent/AgentTest.java Updates tests for new link/logging and reconnect helpers.
agent/src/main/java/com/cloud/agent/SynchronousListener.java New blocking listener for synchronous waits on answers.
agent/src/main/java/com/cloud/agent/ServerListener.java New listener interface for agent-side ServerAttache callbacks.
agent/src/main/java/com/cloud/agent/ServerAttache.java New agent-side counterpart to MS Attache for command/answer flow.
agent/src/main/java/com/cloud/agent/properties/AgentProperties.java Adds new agent properties for async timeouts and status-check delay.
agent/src/main/java/com/cloud/agent/IAgentShell.java Adds setter for dynamically updated backoff algorithm.
agent/src/main/java/com/cloud/agent/HostConnectProcess.java New connect/status-check orchestration loop on the agent side.
agent/src/main/java/com/cloud/agent/AgentShell.java Uses backoff factory + persists configuration; relaxes version handling.
agent/src/main/java/com/cloud/agent/Agent.java Major reconnect/startup refactor; integrates HostConnectProcess and ServerAttache.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java Outdated
Comment thread engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java Outdated
Comment thread engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java Outdated
Comment thread utils/src/main/java/com/cloud/utils/nio/Link.java
Comment thread utils/src/main/java/com/cloud/utils/nio/NioConnection.java
Comment thread core/src/main/java/com/cloud/resource/ServerResource.java
@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 18146

@sureshanaparti
Copy link
Copy Markdown
Contributor Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 47 out of 47 changed files in this pull request and generated 7 comments.

Comment on lines +2014 to 2028
Map<String, String> backoffConfiguration = ConfigKeyUtil.toMap(BackoffConfiguration.value());
StartupAnswer[] answers = new StartupAnswer[cmds.length];
Command cmd;
for (int i = 0; i < cmds.length; i++) {
cmd = cmds[i];
if (cmd instanceof StartupRoutingCommand || cmd instanceof StartupProxyCommand || cmd instanceof StartupSecondaryStorageCommand ||
cmd instanceof StartupStorageCommand) {
answers[i] = new StartupAnswer((StartupCommand) cmds[i], 0, "", "", mgmtServiceConf.getPingInterval());
break;
if (cmd instanceof StartupRoutingCommand || cmd instanceof StartupProxyCommand || cmd instanceof StartupSecondaryStorageCommand
|| cmd instanceof StartupStorageCommand) {
StartupAnswer answer = new StartupAnswer((StartupCommand) cmds[i], 0, "", "", mgmtServiceConf.getPingInterval());
answer.setParams(backoffConfiguration);
answer.setAgentHostStatusCheckDelaySec(AgentHostStatusCheckDelay.value());
answers[i] = answer;
}
}
Response response;
response = new Response(request, answers[0], _nodeId, -1);
Response response = new Response(request, answers, _nodeId, -1);
try {
Comment on lines +588 to +604
private AgentConnectStatusAnswer getConnectStatusAnswer(HostVO hostVo, AgentConnectStatusCommand cmd) {
long hostId = hostVo.getId();
String hostName = hostVo.getName();
String lockName = getHostJoinLockName(hostId);
Status status = hostVo.getStatus();
try {
boolean lockAvailable = GlobalLock.isLockAvailable(lockName);
String details = String.format("Global lock %s is%s present for %s", lockName, lockAvailable ? "" : " not",
hostName);
logger.debug(details);
return getAgentConnectStatusAnswer(cmd, lockName, hostName, lockAvailable, status, details);
} catch (RuntimeException e) {
String msg = String.format("Failed to check global lock %s presence for %s: %s", lockName, hostName,
e.getMessage());
logger.warn(msg, e);
return new AgentConnectStatusAnswer(cmd, false, msg);
}
Comment on lines +139 to +151
public static boolean isLockAvailable(String name) {
if (logger.isDebugEnabled()) {
logger.debug("Checking lock present for {}", name);
}
boolean result = false;
try {
result = DbUtil.isFreeLock(name);
} finally {
if (logger.isDebugEnabled()) {
logger.debug("Result of checking lock present for {}: {}", name, result);
}
}
return result;
Comment on lines +839 to +843
boolean reconnectForCurrentLink = link == this.link;
boolean currentLinkTerminated = this.link != null && this.link.isTerminated();
boolean reconnectForNewHost = this.hostname != null && this.hostname.equals(preferredHost);
// if none of the above is true
boolean stormDetected = ! (reconnectForCurrentLink || currentLinkTerminated || reconnectForNewHost);
Comment on lines +53 to +59
@Test
public void testScheduleConnectProcess() throws InterruptedException, CloudException {

hostConnectProcess.scheduleConnectProcess(link, connectionTransfer);
Assert.assertTrue(hostConnectProcess.isInProgress());
}
}
Comment on lines +226 to +228
ServerResource serverResource = _agent.getResource();
StartupCommand[] startup = serverResource.initialize();
if (ArrayUtils.isEmpty(startup)) {
Comment on lines +119 to +135
public static String getHostLog(String address, Integer port) {
try {
InetAddress inetAddress = InetAddress.getByName(address);
String hostName = inetAddress.getHostName();
String ipAddress = inetAddress.getHostAddress();
if (port == null) {
return String.format("%s/%s", ipAddress, hostName);
}
return String.format("%s/%s:%s", ipAddress, hostName, port);
} catch (UnknownHostException e) {
LOGGER.warn("Failed to resolve name for address {}", address, e);
}
if (port == null) {
return address;
}
return String.format("%s:%s", address, port);
}
@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 18149

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants