Failover is delayed by waiting for a topology with more than one instance #1046

danielbaniel · 2024-06-26T18:40:00Z

Describe the bug

The bug lies in the code here:

aws-advanced-jdbc-wrapper/wrapper/src/main/java/software/amazon/jdbc/plugin/failover/ClusterAwareWriterFailoverHandler.java

Line 408 in d9a563b

if (topology.size() == 1) {

I don't know the history of this check, but it's problematic in a few situations.

Take a two instance cluster with instance Foo and instance Bar. Lets say Foo is the writer. Foo crashes and Bar gets promoted to the writer. When Bar becomes available the driver will get stuck in this loop until Foo comes up as a reader (which may never happen in a bounded time depending on other problems) and brings the topology size to two. However, as soon as the driver is connected to Bar it has a writer connection and can complete the failover so all the additional downtime is unnecessary.

Expected Behavior

I expect the driver to return availability to clients looking for a writer as soon as a new writer is connected to regardless of the rest of the topology in terms of number of readers and their health.

What plugins are used? What other connection properties were set?

aurora-mysql

Current Behavior

When connecting to a two instance aurora mysql cluster and calling the failover-db-cluster api the failover of the driver won't complete until both instances restart (the reader gets promoted and restarts as a writer and the old writer restarts as a reader). It should complete as soon as the new writer is up.

Reproduction Steps

Create a two instance mysql cluster. Connect and send queries with the driver. Trigger failover with the api. Wait for the FailoverSuccessSQLException. Note that this comes later than the time when the new writer comes up. You can get this from the db cloudwatch logs for example.

Possible Solution

No response

Additional Information/Context

No response

The AWS Advanced JDBC Driver version used

latest

JDK version used

11

Operating System and version

osx

ucjonathan · 2024-07-07T11:53:01Z

@danielbaniel I don't use MySQL, but since you pointed out the exact like of problematic code, I believe that statement should be changed to:

if (topology.size() == 1 && getWriter(topology) == null) {

If we have a topology of 1 and there is no writer, then log that message otherwise connect to that writer.

danielbaniel · 2024-07-08T15:25:31Z

Hey @ucjonathan, this issue isn't mysql specific and applies to pg too. I filled in the issue incorrectly because I only specified the aurora-mysql plugin in this issue description but it affects both.

In either case however, your fix suggestion seems appropriate. As soon as the driver is connected to a writer it should go ahead and serve requests, no reason to wait for other instances.

I expect it will apply to MAZ clusters too not just Aurora. Whatever the context, as soon as you have a writer there's no need to wait for another instance to be up if you're looking for a writer endpoint.

sergiyvamz · 2024-09-19T17:24:02Z

Hi, @danielbaniel @ucjonathan

A new version of failover plugin has been merged recently. It's a reworked and re-architected plugin to support cluster failover. In general, a new failover2 plugin shows a better stability and we hope it may solve the issue you reported.

The new plugin is available in the latest snapshot build. Could you kindly checkout our snapshot build and let us know
if the issue still persists with a new failover2 plugin?

https://github.com/aws/aws-advanced-jdbc-wrapper/blob/main/docs/using-the-jdbc-driver/using-plugins/UsingTheFailover2Plugin.md

https://github.com/aws/aws-advanced-jdbc-wrapper/blob/main/docs/using-the-jdbc-driver/UsingTheJdbcDriver.md#using-a-snapshot-of-the-driver

Thank you!

danielbaniel added the bug Something isn't working label Jun 26, 2024

crystall-bitquill assigned sergiyvamz Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failover is delayed by waiting for a topology with more than one instance #1046

Failover is delayed by waiting for a topology with more than one instance #1046

danielbaniel commented Jun 26, 2024

ucjonathan commented Jul 7, 2024

danielbaniel commented Jul 8, 2024 •

edited

Loading

sergiyvamz commented Sep 19, 2024

Failover is delayed by waiting for a topology with more than one instance #1046

Failover is delayed by waiting for a topology with more than one instance #1046

Comments

danielbaniel commented Jun 26, 2024

Describe the bug

Expected Behavior

What plugins are used? What other connection properties were set?

Current Behavior

Reproduction Steps

Possible Solution

Additional Information/Context

The AWS Advanced JDBC Driver version used

JDK version used

Operating System and version

ucjonathan commented Jul 7, 2024

danielbaniel commented Jul 8, 2024 • edited Loading

sergiyvamz commented Sep 19, 2024

danielbaniel commented Jul 8, 2024 •

edited

Loading