Skip to content

Conversation

@ggivo
Copy link
Contributor

@ggivo ggivo commented Nov 19, 2025

PR Description

Summary

Implements sliding time window metrics tracker for automatic failover by porting Resilience4j's LockFreeSlidingTimeWindowMetrics.

Problem

The initial POC implementation didn't pre-aggregate metrics, causing performance degradation as the time window increased.

Solution

  • Ported LockFreeSlidingTimeWindowMetrics from Resilience4j with modifications:
    • Java 8 compatibility: Replaced VarHandle with AtomicReference
    • Stripped down to track only success/failure counts (removed duration and slow call tracking)
    • Lock-free implementation using CAS for thread-safe concurrent access

Changes

  • Replaced LockFreeSlidingWindowMetrics implementation with lock-free reselience4j port
  • Updated CircuitBreakerMetricsImpl to use new metrics implementation

Bug Fixes:
- Fix: Ensure snapshot metrics remain accurate after a full window rotation
- Fix: events recorded exactly at bucket boundaries were miscounted
- Enforce window size % bucket size == 0
- Move LockFreeSlidingWindowMetricsUnitTests to correct package
  (io.lettuce.core.failover.metrics)
   - remove snapshotTime - not used & not correctly calcualted
   - remove reset metrics - unused as of now
@ggivo ggivo changed the title [automatic failover][lettuce] Implement sliding time window metrics tracker [automatic failover] Implement sliding time window metrics tracker Nov 19, 2025
@ggivo
Copy link
Contributor Author

ggivo commented Nov 19, 2025

I believe failover metrics-related classes should not be part of the publicly supported API
Should we move them into io.lettuce.core.failover.internal.metrics or mark them internal in any other way?
@tishun @atakavci

Copilot finished reviewing on behalf of ggivo November 19, 2025 15:00
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a lock-free sliding time window metrics tracker for automatic failover by porting Resilience4j's LockFreeSlidingTimeWindowMetrics. The implementation replaces the previous POC version that didn't pre-aggregate metrics, which caused performance issues as the time window increased.

Key changes:

  • Ported LockFreeSlidingTimeWindowMetrics from Resilience4j with Java 8 compatibility (using AtomicReferenceFieldUpdater instead of VarHandle)
  • Removed the old LockFreeSlidingWindowMetrics and TimeWindowBucket classes
  • Simplified the API by removing the reset() method from metrics interfaces
  • Converted MetricsSnapshot from a concrete class to an interface with MetricsSnapshotImpl as implementation

Reviewed Changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
src/main/java/io/lettuce/core/failover/metrics/LockFreeSlidingTimeWindowMetrics.java New lock-free sliding window implementation ported from Resilience4j using linked list of time slices
src/main/java/io/lettuce/core/failover/metrics/PackedAggregation.java Cache-friendly measurement implementation for tracking call counts and outcomes
src/main/java/io/lettuce/core/failover/metrics/Outcome.java Simple enum for SUCCESS/FAILURE outcomes
src/main/java/io/lettuce/core/failover/metrics/CumulativeMeasurement.java Interface for measurements that accumulate call outcomes
src/main/java/io/lettuce/core/failover/metrics/MeasurementData.java Interface for accessing measurement data
src/main/java/io/lettuce/core/failover/metrics/Clock.java Clock abstraction for testable time-dependent code
src/main/java/io/lettuce/core/failover/metrics/MetricsSnapshot.java Converted from concrete class to interface
src/main/java/io/lettuce/core/failover/metrics/MetricsSnapshotImpl.java New implementation of MetricsSnapshot interface
src/main/java/io/lettuce/core/failover/metrics/CircuitBreakerMetricsImpl.java Updated to use new LockFreeSlidingTimeWindowMetrics with seconds-based window configuration
src/main/java/io/lettuce/core/failover/metrics/CircuitBreakerMetrics.java Removed reset() method from interface
src/main/java/io/lettuce/core/failover/metrics/SlidingWindowMetrics.java Removed reset() method from interface
src/main/java/io/lettuce/core/failover/metrics/LockFreeSlidingWindowMetrics.java Removed old implementation
src/main/java/io/lettuce/core/failover/metrics/TimeWindowBucket.java Removed old bucket implementation
src/test/java/io/lettuce/core/failover/metrics/TestClock.java New controllable clock implementation for testing
src/test/java/io/lettuce/core/failover/metrics/SlidingWindowMetricsUnitTests.java New comprehensive unit tests for the sliding window implementation
src/test/java/io/lettuce/core/failover/metrics/SlidingWindowMetricsPerformanceTests.java Updated performance tests to use new implementation
src/test/java/io/lettuce/core/failover/LockFreeSlidingWindowMetricsUnitTests.java Removed old unit tests
src/test/jmh/io/lettuce/core/failover/metrics/FailoverMetricsBenchmark.java New JMH benchmark for metrics performance testing
src/test/jmh/io/lettuce/core/failover/metrics/JmhMain.java New JMH test launcher for manual benchmark execution

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@atakavci
Copy link
Collaborator

I believe failover metrics-related classes should not be part of the publicly supported API Should we move them into io.lettuce.core.failover.internal.metrics or mark them internal in any other way? @tishun @atakavci

Lets make MetricsSnapshot public,, for the rest it is package private.
To do that, all needs to move to failover package, which is fine to me.
OR
We can leave them as public types, and expose/handover only interfaces rather than concrete instances from our API.

TBH, i dont like that internals territory at all.

   - fix incorrect javadoc
   - fix failing benchmark
 - CircuitBreakerMetrics, MetricsSnapshot - public
 - metrics implementation details stay inside io.lettuce.core.failover.metrics
 - Update CircuitBreaker to obtain its metrics via CircuitBreakerMetricsFactory.createLockFree()
@ggivo
Copy link
Contributor Author

ggivo commented Nov 20, 2025

@atakavci

Lets make MetricsSnapshot public, for the rest it is package private.

To avoid including metrics in the parent failover package, I kept the classes package private under **failover. metrics
Kept public MetricsSnapshot & CircuitBreakerMetrics interfaces and introduced a factory for creating the concrete implementations

@ggivo ggivo requested a review from Copilot November 20, 2025 13:56
Copilot finished reviewing on behalf of ggivo November 20, 2025 14:00
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Collaborator

@atakavci atakavci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for investing time on performance @ggivo , this is as good as it gets in the time frame.
Just left a couple of comments.. The only interesting one would be the one with reset.

/**
* Reset all metrics to zero.
*/
void reset();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be cases where the CB transitions into OPEN and CLOSED states in a quick fashion which can fall into the configured window size. Though we will have grace periods and health check cycles, this might be useful to make sure when multiple transitions happen close enough to risk the CB behaviour.

Copy link
Contributor Author

@ggivo ggivo Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atakavci
Yes, came to the same conclusion reviewing other PR's,
I was thinking of recreating the Metrics inside CircuiteBreaker, but it makes more sense to delegate to Metrics implementations to handle the reset.
Will work on implementing it back!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atakavci
Looking at the code, I think it will be easier and safer to implement the reset at the CircuitBreaker level.
e.g, CircuitBreakerMetrics are immutable, and when we call reset inside CircuitBreaker, we create a new instance.

Also considering opening a dedicated PR for this, since this one is starting to become hard for review

At the same time started to question if we need to reset the metrics when transitioning between OPEN, and CLOSED states, or if we should preserve them since it is a moving window, and calls made during that period are valid.

Let's discuss.

    - remove CircuitBreakerMetrics, CircuitBreakerMetricsImpl
    - rename SlidingWindowMetrics -> CircuitBreakerMetrics
@ggivo ggivo requested review from Copilot and tishun November 21, 2025 13:15
Copilot finished reviewing on behalf of ggivo November 21, 2025 13:20
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 13 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants