Quick Description
Restarting a running server in a server cluster will cause subsequent connection attempts to this server by other groups to fail and end up performing infinite reconnection attempts.
Explanation
This was experienced using:
OS: Windows 10
DarkRift Version: 2.10.1 (Pro)
Cluster.config file used:
<?xml version="1.0" encoding="utf-8" ?>
<cluster>
<groups>
<group name="SubServer" visibility="external">
<connectsTo name="MainServer" />
</group>
<group name="MainServer" visibility="internal" />
</groups>
</cluster>
Steps to reproduce:
(Assuming consul is running)
- Start the MainServer
- Start the SubServer
So far the MainServer will correctly pick up the SubServer and the SubServer will perform a connection attempt that results in "Connected to server 0 on 127.0.0.1:4000" as expected. All the events on the MainServer is fired as they should. If I restart the SubServer it will keep working as expected.
Now, to replicate the issue (I've tried it a few times now and this works every time):
- Stop the MainServer
- Start the MainServer again
- Start the SubServer
This is where it goes wrong and the SubServer ends up basically doing infinite reconnection attempts (notice, in the logs it will say "Attempt 1" after every attempt). The only way I can fix this is by restarting the machine and then by following step 1 and 2 it will work again (and step 3, 4, 5 will break it again).
Logs
When the issue occurs, the following traces will be continiously spammed on the MainServer:
[Trace] DefaultNetworkListener Accepted TCP connection from 127.0.0.1:50099.
[Trace] DefaultNetworkListener Accepted UDP connection from 127.0.0.1:62471.
[Trace] RemoteServerManager New server connected, awaiting identification [127.0.0.1:50099|127.0.0.1:62471].
[Trace] RemoteServerManager Server at [127.0.0.1:50099|127.0.0.1:62471] has identified as server 4.
[Trace] RemoteServerManager Server at [127.0.0.1:50099|127.0.0.1:62471 connected and identified itself as server 4 however the registry has not yet propgated information about that server. The connection has been dropped.
And the following traces will be continiously spammed on the SubServer:
[Trace] UpstreamServerGroup Lost connection to server 3 on 127.0.0.1:4000.
[Trace] UpstreamServerGroup Reconnecting to server 3 on 127.0.0.1:4000. Attempt 1.
[Info] UpstreamServerGroup Reconnected to server 3 on 127.0.0.1:4000.
Quick Description
Restarting a running server in a server cluster will cause subsequent connection attempts to this server by other groups to fail and end up performing infinite reconnection attempts.
Explanation
This was experienced using:
OS: Windows 10
DarkRift Version: 2.10.1 (Pro)
Cluster.config file used:
Steps to reproduce:
(Assuming consul is running)
So far the MainServer will correctly pick up the SubServer and the SubServer will perform a connection attempt that results in "Connected to server 0 on 127.0.0.1:4000" as expected. All the events on the MainServer is fired as they should. If I restart the SubServer it will keep working as expected.
Now, to replicate the issue (I've tried it a few times now and this works every time):
This is where it goes wrong and the SubServer ends up basically doing infinite reconnection attempts (notice, in the logs it will say "Attempt 1" after every attempt). The only way I can fix this is by restarting the machine and then by following step 1 and 2 it will work again (and step 3, 4, 5 will break it again).
Logs
When the issue occurs, the following traces will be continiously spammed on the MainServer:
And the following traces will be continiously spammed on the SubServer: