-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[automatic failover] Implement sliding time window metrics tracker #3521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feature/automatic-failover-1
Are you sure you want to change the base?
[automatic failover] Implement sliding time window metrics tracker #3521
Conversation
Bug Fixes: - Fix: Ensure snapshot metrics remain accurate after a full window rotation - Fix: events recorded exactly at bucket boundaries were miscounted - Enforce window size % bucket size == 0 - Move LockFreeSlidingWindowMetricsUnitTests to correct package (io.lettuce.core.failover.metrics)
- remove snapshotTime - not used & not correctly calcualted - remove reset metrics - unused as of now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a lock-free sliding time window metrics tracker for automatic failover by porting Resilience4j's LockFreeSlidingTimeWindowMetrics. The implementation replaces the previous POC version that didn't pre-aggregate metrics, which caused performance issues as the time window increased.
Key changes:
- Ported
LockFreeSlidingTimeWindowMetricsfrom Resilience4j with Java 8 compatibility (usingAtomicReferenceFieldUpdaterinstead ofVarHandle) - Removed the old
LockFreeSlidingWindowMetricsandTimeWindowBucketclasses - Simplified the API by removing the
reset()method from metrics interfaces - Converted
MetricsSnapshotfrom a concrete class to an interface withMetricsSnapshotImplas implementation
Reviewed Changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| src/main/java/io/lettuce/core/failover/metrics/LockFreeSlidingTimeWindowMetrics.java | New lock-free sliding window implementation ported from Resilience4j using linked list of time slices |
| src/main/java/io/lettuce/core/failover/metrics/PackedAggregation.java | Cache-friendly measurement implementation for tracking call counts and outcomes |
| src/main/java/io/lettuce/core/failover/metrics/Outcome.java | Simple enum for SUCCESS/FAILURE outcomes |
| src/main/java/io/lettuce/core/failover/metrics/CumulativeMeasurement.java | Interface for measurements that accumulate call outcomes |
| src/main/java/io/lettuce/core/failover/metrics/MeasurementData.java | Interface for accessing measurement data |
| src/main/java/io/lettuce/core/failover/metrics/Clock.java | Clock abstraction for testable time-dependent code |
| src/main/java/io/lettuce/core/failover/metrics/MetricsSnapshot.java | Converted from concrete class to interface |
| src/main/java/io/lettuce/core/failover/metrics/MetricsSnapshotImpl.java | New implementation of MetricsSnapshot interface |
| src/main/java/io/lettuce/core/failover/metrics/CircuitBreakerMetricsImpl.java | Updated to use new LockFreeSlidingTimeWindowMetrics with seconds-based window configuration |
| src/main/java/io/lettuce/core/failover/metrics/CircuitBreakerMetrics.java | Removed reset() method from interface |
| src/main/java/io/lettuce/core/failover/metrics/SlidingWindowMetrics.java | Removed reset() method from interface |
| src/main/java/io/lettuce/core/failover/metrics/LockFreeSlidingWindowMetrics.java | Removed old implementation |
| src/main/java/io/lettuce/core/failover/metrics/TimeWindowBucket.java | Removed old bucket implementation |
| src/test/java/io/lettuce/core/failover/metrics/TestClock.java | New controllable clock implementation for testing |
| src/test/java/io/lettuce/core/failover/metrics/SlidingWindowMetricsUnitTests.java | New comprehensive unit tests for the sliding window implementation |
| src/test/java/io/lettuce/core/failover/metrics/SlidingWindowMetricsPerformanceTests.java | Updated performance tests to use new implementation |
| src/test/java/io/lettuce/core/failover/LockFreeSlidingWindowMetricsUnitTests.java | Removed old unit tests |
| src/test/jmh/io/lettuce/core/failover/metrics/FailoverMetricsBenchmark.java | New JMH benchmark for metrics performance testing |
| src/test/jmh/io/lettuce/core/failover/metrics/JmhMain.java | New JMH test launcher for manual benchmark execution |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/test/jmh/io/lettuce/core/failover/metrics/FailoverMetricsBenchmark.java
Outdated
Show resolved
Hide resolved
src/test/jmh/io/lettuce/core/failover/metrics/FailoverMetricsBenchmark.java
Outdated
Show resolved
Hide resolved
src/main/java/io/lettuce/core/failover/metrics/CircuitBreakerMetricsImpl.java
Outdated
Show resolved
Hide resolved
src/test/jmh/io/lettuce/core/failover/metrics/FailoverMetricsBenchmark.java
Outdated
Show resolved
Hide resolved
src/main/java/io/lettuce/core/failover/metrics/LockFreeSlidingTimeWindowMetrics.java
Outdated
Show resolved
Hide resolved
Lets make TBH, i dont like that |
- CircuitBreakerMetrics, MetricsSnapshot - public - metrics implementation details stay inside io.lettuce.core.failover.metrics - Update CircuitBreaker to obtain its metrics via CircuitBreakerMetricsFactory.createLockFree()
To avoid including metrics in the parent failover package, I kept the classes package private under **failover. metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 21 out of 21 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/main/java/io/lettuce/core/failover/metrics/PackedAggregation.java
Outdated
Show resolved
Hide resolved
Co-authored-by: Copilot <[email protected]>
atakavci
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for investing time on performance @ggivo , this is as good as it gets in the time frame.
Just left a couple of comments.. The only interesting one would be the one with reset.
| /** | ||
| * Reset all metrics to zero. | ||
| */ | ||
| void reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be cases where the CB transitions into OPEN and CLOSED states in a quick fashion which can fall into the configured window size. Though we will have grace periods and health check cycles, this might be useful to make sure when multiple transitions happen close enough to risk the CB behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atakavci
Yes, came to the same conclusion reviewing other PR's,
I was thinking of recreating the Metrics inside CircuiteBreaker, but it makes more sense to delegate to Metrics implementations to handle the reset.
Will work on implementing it back!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atakavci
Looking at the code, I think it will be easier and safer to implement the reset at the CircuitBreaker level.
e.g, CircuitBreakerMetrics are immutable, and when we call reset inside CircuitBreaker, we create a new instance.
Also considering opening a dedicated PR for this, since this one is starting to become hard for review
At the same time started to question if we need to reset the metrics when transitioning between OPEN, and CLOSED states, or if we should preserve them since it is a moving window, and calls made during that period are valid.
Let's discuss.
src/main/java/io/lettuce/core/failover/metrics/CircuitBreakerMetricsFactory.java
Outdated
Show resolved
Hide resolved
src/main/java/io/lettuce/core/failover/metrics/SlidingWindowMetrics.java
Outdated
Show resolved
Hide resolved
- remove CircuitBreakerMetrics, CircuitBreakerMetricsImpl
- rename SlidingWindowMetrics -> CircuitBreakerMetrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 21 out of 21 changed files in this pull request and generated 13 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
PR Description
Summary
Implements sliding time window metrics tracker for automatic failover by porting Resilience4j's
LockFreeSlidingTimeWindowMetrics.Problem
The initial POC implementation didn't pre-aggregate metrics, causing performance degradation as the time window increased.
Solution
LockFreeSlidingTimeWindowMetricsfrom Resilience4j with modifications:VarHandlewithAtomicReferenceChanges
LockFreeSlidingWindowMetricsimplementation with lock-free reselience4j portCircuitBreakerMetricsImplto use new metrics implementation