initiating connection migration with tmb? #98

javafanboy · 2023-03-14T08:45:26Z

We are getting quite frequent warnings in our Coherence log on storage enabled nodes mentioning "initiating connection migration with tmb" and I would like some info about what it means on a more technical level and perhaps suggestions on most common causes?

I am guessing some problem on the physical network as well as long GC pauses could result in more or less any network related warning in Coherence but are there also other possible reasons and are there any tips on how to further debug the problem?

2023-03-13 20:17:56.310/95085.948 Oracle Coherence CE 14.1.1.0.12 (thread=SelectionService(channels=112, selector=MultiplexedSelector(sun.nio.ch.EPollSelectorImpl@3676ac27), id=1660451908), member=17): tmb://138.106.96.41:9001.49982 initiating connection migration with tmb://138.106.96.25:33316.40573 after 15s ack timeout health(read=false, write=true), receiptWait=Message "PartialValueResponse"
{
FromMember=Member(Id=17, Timestamp=2023-03-12 17:53:14.974, Address=138.106.96.41:9001, MachineId=46694, Location=site:sss.se.xxxxyyyy.com,machine:l4041p.sss.se.xxxxyyyy.com,process:391,member:l4041p-2, Role=storagenode)
FromMessageId=0
MessagePartCount=0
PendingCount=0
BufferCounter=1
MessageType=70
ToPollId=19827300
Poll=null
Packets
{
}
Service=PartitionedCache{Name=DistributedCache, State=(SERVICE_STARTED), Id=3, OldestMemberId=1, LocalStorage=enabled, PartitionCount=601, BackupCount=1, AssignedPartitions=18, BackupPartitions=19, CoordinatorId=1}
ToMemberSet=MemberSet(Size=1
Member(Id=179, Timestamp=2023-03-12 18:43:36.847, Address=138.106.96.25:33316, MachineId=48549, Location=site:sss.se.xxxxyyyy.com,machine:l4025p,process:3473,member:l4025p_11990, Role=scex)
)
NotifyDelivery=false
}: peer=tmb://138.106.96.25:33316.40573, state=ACTIVE, socket=MultiplexedSocket{Socket[addr=/138.106.96.25,port=38114,localport=9001]}, migrations=17, bytes(in=104371345, out=101784244), flushlock false, bufferedOut=0B, unflushed=0B, delivered(in=203177, out=197772), timeout(ack=0ns), interestOps=1, unflushed receipt=0, receiptReturn 0, isReceiptFlushRequired false, bufferedIn(), msgs(in=95922, out=99203/99206)
2023-03-13 20:17:56.310/95085.948 Oracle Coherence CE 14.1.1.0.12 (thread=SelectionService(channels=112, selector=MultiplexedSelector(sun.nio.ch.EPollSelectorImpl@3676ac27), id=1660451908), member=17): tmb://138.106.96.41:9001.49982 initiating connection migration with tmb://138.106.96.32:41070.40752 after 15s ack timeout health(read=true, write=false), receiptWait=null: peer=tmb://138.106.96.32:41070.40752, state=ACTIVE, socket=MultiplexedSocket{Socket[addr=/138.106.96.32,port=41070,localport=36388]}, migrations=5, bytes(in=95752773, out=99458811), flushlock false, bufferedOut=1.54KB, unflushed=0B, delivered(in=192506, out=187239), timeout(ack=0ns), interestOps=1, unflushed receipt=0, receiptReturn 0, isReceiptFlushRequired false, bufferedIn(), msgs(in=90667, out=93950/93953)

javafanboy · 2023-03-17T07:12:25Z

We continue seeing this message semi frequently for several days so I do not think we have any underlying network problem (we run on an enterprise class DC network where any disturbances are resolved quickly) and I can't see any excessive GC pauses in our log...

Is this a message we need to act on or not critical?

javafanboy · 2023-03-17T08:34:28Z

Seem to be triggered in class com.oracle.coherence.common.internal.net.socketbus.BufferedSocketBus method checkHealth(long) but it is still not clear to me exactly what ack it is that is timing out or what the code is trying to "do about it" i.e. the "migration":

        else if (ldtNow >= ldtAckTimeout)
            {
            // timeout expired
            final int cMultCap = 10;
            long      cMillisTimeout = f_driver.getDependencies().getAckTimeoutMillis();
            Duration  dur            = new Duration(cMillisTimeout * Math.min(cMultCap, m_cUnackLast + 1), Duration.Magnitude.MILLI);

            getLogger().log(makeRecord(Level.WARNING,
                    "{0} initiating connection migration with {1} after {2} ack timeout health(read={3}, write={4}), receiptWait={5}: {6}",
                    getLocalEndPoint(), getPeer(), dur, fReadHealthy, fWriteHealthy, oReceiptUnacked, BufferedConnection.this));

mgamanho · 2023-03-17T16:49:26Z

Hi, these messages can be quite common depending on system size, load, context (application busy, GC, ...) and as you can see they are harmless.

The heartbeat (health check) is exchanged regularly by cluster members to ensure the cluster is whole. Everyone looks after each other, so that generates a fair amount of ancillary activity for which once in a while we detect issues. At that point we decide to "migrate" the connection, which is merely opening a new connection and ditching the old one. In some cases a "fresh" connection may resolve OS or JVM congestion issues. In most cases it is just a precaution that has no consequences.

These heartbeat messages are purely for housekeeping and they are timed relatively aggressively to weed out potential issues. Application/data traffic is not impacted. If you do see them frequently, however, they are an indication that some tuning may be necessary: heap and GC pressure, network health, native memory (OS/hardware) for example. Running network performance tests can show you if network settings are ok, they can be eye opening.

Let us know how it goes.

javafanboy · 2023-03-17T17:21:48Z

Thanks a lot for the info - will continue investigating!

…

On Fri, Mar 17, 2023, 17:49 Maurice Gamanho ***@***.***> wrote: Hi, these messages can be quite common depending on system size, load, context (application busy, GC, ...) and as you can see they are harmless. The heartbeat (health check) is exchanged regularly by cluster members to ensure the cluster is whole. Everyone looks after each other, so that generates a fair amount of ancillary activity for which once in a while we detect issues. At that point we decide to "migrate" the connection, which is merely opening a new connection and ditching the old one. In some cases a "fresh" connection may resolve OS or JVM congestion issues. In most cases it is just a precaution that has no consequences. These heartbeat messages are purely for housekeeping and they are timed relatively aggressively to weed out potential issues. Application/data traffic is not impacted. If you do see them frequently, however, they are an indication that some tuning may be necessary: heap and GC pressure, network health, native memory (OS/hardware) for example. Running network performance tests <https://docs.oracle.com/en/middleware/standalone/coherence/14.1.1.0/administer/performing-network-performance-test.html> can show you if network settings are ok, they can be eye opening. Let us know how it goes. — Reply to this email directly, view it on GitHub <#98 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADXQF42BBGGT42MGFDNXMTW4SI2DANCNFSM6AAAAAAV2EBS3E> . You are receiving this because you authored the thread.Message ID: ***@***.***>

javafanboy added the RFA label Mar 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

initiating connection migration with tmb? #98

initiating connection migration with tmb? #98

javafanboy commented Mar 14, 2023 •

edited

Loading

javafanboy commented Mar 17, 2023 •

edited

Loading

javafanboy commented Mar 17, 2023 •

edited

Loading

mgamanho commented Mar 17, 2023

javafanboy commented Mar 17, 2023 via email

initiating connection migration with tmb? #98

initiating connection migration with tmb? #98

Comments

javafanboy commented Mar 14, 2023 • edited Loading

javafanboy commented Mar 17, 2023 • edited Loading

javafanboy commented Mar 17, 2023 • edited Loading

mgamanho commented Mar 17, 2023

javafanboy commented Mar 17, 2023 via email

javafanboy commented Mar 14, 2023 •

edited

Loading

javafanboy commented Mar 17, 2023 •

edited

Loading

javafanboy commented Mar 17, 2023 •

edited

Loading