-
Notifications
You must be signed in to change notification settings - Fork 5.5k
feat: Add memory-bounded percentile calculation support to runtime metrics #26704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Reviewer's GuideIntroduces optional, memory-bounded percentile (p90/p95/p99) tracking for RuntimeMetric using a fixed-width histogram, wires it into RuntimeStats and enables it for a specific latency metric, along with extensive unit tests and JSON/Thrift serialization support. Sequence diagram for percentile-enabled runtime metric recordingsequenceDiagram
actor Client
participant StageExecutionStateMachine
participant RuntimeStats
participant RuntimeMetric
Client->>StageExecutionStateMachine: recordStartWaitForEventLoop(long nanos)
StageExecutionStateMachine->>RuntimeStats: addMetricValue(TASK_START_WAIT_FOR_EVENT_LOOP, NANO, max(nanos, 0), true)
alt metric does not exist yet
RuntimeStats->>RuntimeStats: computeIfAbsent(name, k -> new RuntimeMetric(name, unit, true))
RuntimeStats->>RuntimeMetric: RuntimeMetric(name, unit, true)
RuntimeMetric->>RuntimeMetric: determineBucketWidth(unit)
RuntimeMetric->>RuntimeMetric: allocate histogramBuckets[numBuckets]
end
RuntimeStats->>RuntimeMetric: addValue(value)
RuntimeMetric->>RuntimeMetric: sum.addAndGet(value)
RuntimeMetric->>RuntimeMetric: count.incrementAndGet()
RuntimeMetric->>RuntimeMetric: max.accumulateAndGet(value)
RuntimeMetric->>RuntimeMetric: min.accumulateAndGet(value)
alt percentileTrackingEnabled
RuntimeMetric->>RuntimeMetric: getBucketIndex(value)
RuntimeMetric->>RuntimeMetric: histogramBuckets.incrementAndGet(bucketIndex)
end
note over Client,RuntimeMetric: Later, when stats are read/serialized
Client->>RuntimeMetric: getP95()
alt percentileTrackingEnabled
RuntimeMetric->>RuntimeMetric: computePercentile(0.95) if p95 is null
RuntimeMetric->>Client: p95 value (approximate)
else tracking disabled
RuntimeMetric-->>Client: null
end
Class diagram for updated RuntimeMetric and RuntimeStats percentile supportclassDiagram
class RuntimeMetric {
<<ThriftStruct>>
- static int DEFAULT_NUM_BUCKETS
- int numBuckets
- long bucketWidth
- String name
- RuntimeUnit unit
- AtomicLong sum
- AtomicLong count
- AtomicLong max
- AtomicLong min
- volatile boolean percentileTrackingEnabled
- volatile AtomicLongArray histogramBuckets
- volatile Long p90
- volatile Long p95
- volatile Long p99
+ RuntimeMetric(String name, RuntimeUnit unit)
+ RuntimeMetric(String name, RuntimeUnit unit, boolean trackPercentiles)
+ RuntimeMetric(String name, RuntimeUnit unit, boolean trackPercentiles, long bucketWidth)
+ RuntimeMetric(String name, RuntimeUnit unit, boolean trackPercentiles, long bucketWidth, int numBuckets)
+ RuntimeMetric(String name, RuntimeUnit unit, long sum, long count, long max, long min)
+ static RuntimeMetric copyOf(RuntimeMetric metric)
+ void set(long sum, long count, long max, long min)
+ void set(RuntimeMetric metric)
+ boolean isPercentileTrackingEnabled()
+ String getName()
+ void addValue(long value)
+ void mergeWith(RuntimeMetric metric)
+ long getSum()
+ long getCount()
+ long getMax()
+ long getMin()
+ RuntimeUnit getUnit()
+ int getNumBuckets()
+ long getBucketWidth()
+ Long getP90()
+ Long getP95()
+ Long getP99()
- static long determineBucketWidth(RuntimeUnit unit)
- int getBucketIndex(long value)
- long computePercentile(double percentile)
- static void checkState(boolean condition, String message)
}
class RuntimeStats {
- Map~String, RuntimeMetric~ metrics
+ void addMetricValue(String name, RuntimeUnit unit, long value)
+ void addMetricValue(String name, RuntimeUnit unit, long value, boolean trackPercentiles)
+ void addMetricValueIgnoreZero(String name, RuntimeUnit unit, long value)
+ Map~String, RuntimeMetric~ getMetrics()
}
class StageExecutionStateMachine {
- RuntimeStats runtimeStats
+ void recordTaskUpdateDeliveredTime(long nanos)
+ void recordStartWaitForEventLoop(long nanos)
+ void recordDeliveredUpdates(int updates)
}
RuntimeStats "1" --> "*" RuntimeMetric : manages
StageExecutionStateMachine "1" --> "1" RuntimeStats : uses
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes and found some issues that need to be addressed.
- After JSON deserialization,
percentileTrackingEnabledcan be true whilehistogramBucketsremains null, so subsequentaddValuecalls won’t update percentile state andgetP90/P95/P99will keep returning stale cached values; consider either reconstructing a histogram, disabling percentile tracking on deserialized instances, or clearly separating "cached-only" from "live-tracking" metrics. - The new
addMetricValue(String, RuntimeUnit, long, boolean trackPercentiles)usescomputeIfAbsent, so the first call wins and later calls with a differenttrackPercentilesvalue are silently ignored; if this is not intentional, you may want to enforce consistent configuration or make the flag part of the key/config instead of a per-call parameter. - There is an inconsistency between
set(RuntimeMetric)(which checksnumBucketsbut notbucketWidth) andmergeWith(which checks bothnumBucketsandbucketWidth); consider aligning these preconditions or explicitly documenting whysetallows differing bucket widths whilemergeWithdoesn’t.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- After JSON deserialization, `percentileTrackingEnabled` can be true while `histogramBuckets` remains null, so subsequent `addValue` calls won’t update percentile state and `getP90/P95/P99` will keep returning stale cached values; consider either reconstructing a histogram, disabling percentile tracking on deserialized instances, or clearly separating "cached-only" from "live-tracking" metrics.
- The new `addMetricValue(String, RuntimeUnit, long, boolean trackPercentiles)` uses `computeIfAbsent`, so the first call wins and later calls with a different `trackPercentiles` value are silently ignored; if this is not intentional, you may want to enforce consistent configuration or make the flag part of the key/config instead of a per-call parameter.
- There is an inconsistency between `set(RuntimeMetric)` (which checks `numBuckets` but not `bucketWidth`) and `mergeWith` (which checks both `numBuckets` and `bucketWidth`); consider aligning these preconditions or explicitly documenting why `set` allows differing bucket widths while `mergeWith` doesn’t.
## Individual Comments
### Comment 1
<location> `presto-common/src/main/java/com/facebook/presto/common/RuntimeMetric.java:310-311` </location>
<code_context>
{
requireNonNull(metric, "metric is null");
checkState(unit == metric.getUnit(), "The metric must have the same unit type as the current one.");
+ checkState(numBuckets == metric.numBuckets, "The metric must have the same number of buckets as the current one.");
set(metric.getSum(), metric.getCount(), metric.getMax(), metric.getMin());
+
+ // Copy percentile tracking state
</code_context>
<issue_to_address>
**issue (bug_risk):** Also validate bucketWidth when copying from another metric.
In `set(RuntimeMetric metric)` you only verify `numBuckets`, so it’s possible to copy state from a metric with a different `bucketWidth`, which will corrupt percentile calculations. Add a `checkState(bucketWidth == metric.bucketWidth, ...)` (as in `mergeWith`) before copying the histogram/percentile fields.
</issue_to_address>
### Comment 2
<location> `presto-common/src/main/java/com/facebook/presto/common/RuntimeMetric.java:290-295` </location>
<code_context>
+ // Use provided bucketWidth or auto-configured value if null (for backward compatibility)
+ this.bucketWidth = (bucketWidth != null && bucketWidth > 0) ? bucketWidth : determineBucketWidth(this.unit);
set(sum, count, max, min);
+ // Store cached percentile values from JSON (no histogram needed for deserialization)
+ this.p90 = p90;
+ this.p95 = p95;
+ this.p99 = p99;
+ // If any percentile values present, mark as enabled (even though we don't have histogram)
+ this.percentileTrackingEnabled = (p90 != null || p95 != null || p99 != null);
}
</code_context>
<issue_to_address>
**question (bug_risk):** Merging/deserialized metrics without histograms can permanently lose percentile values.
In the JSON ctor you enable `percentileTrackingEnabled` when any of `p90/p95/p99` are present but don’t reconstruct `histogramBuckets`. After `mergeWith`, the getters clear the cached percentiles and call `computePercentile`, which returns `-1` because `histogramBuckets == null`, so the getters then cache `null` and the percentile data is lost. To avoid this, either keep `percentileTrackingEnabled = false` when no histogram is present and always return the cached deserialized values, or ensure that whenever `percentileTrackingEnabled` is true a histogram is also reconstructed (and encode that in serialization/deserialization).
</issue_to_address>
### Comment 3
<location> `presto-common/src/main/java/com/facebook/presto/common/RuntimeStats.java:98-100` </location>
<code_context>
metrics.computeIfAbsent(name, k -> new RuntimeMetric(name, unit)).addValue(value);
}
+ public void addMetricValue(String name, RuntimeUnit unit, long value, boolean trackPercentiles)
+ {
+ metrics.computeIfAbsent(name, k -> new RuntimeMetric(name, unit, trackPercentiles)).addValue(value);
+ }
+
</code_context>
<issue_to_address>
**question (bug_risk):** Clarify behavior when `trackPercentiles` differs across calls for the same metric.
`computeIfAbsent` only applies `trackPercentiles` on first creation for a given `name`; later calls with a different value reuse the existing metric unchanged. If callers might pass inconsistent values, consider either documenting that the first call “wins” or enforcing consistency (e.g., by checking `isPercentileTrackingEnabled()` and rejecting conflicting usage).
</issue_to_address>
### Comment 4
<location> `presto-common/src/test/java/com/facebook/presto/common/TestRuntimeMetric.java:749-758` </location>
<code_context>
+ }
+
+ @Test
+ public void testSmallCountPercentiles()
+ {
+ RuntimeMetric metric = new RuntimeMetric("latency", NANO, true);
+
+ // Add only 2 values
+ metric.addValue(1_000_000); // 1ms
+ metric.addValue(9_000_000); // 9ms
+
+ assertEquals(metric.getCount(), 2);
+
+ // Percentiles should still work but may not be meaningful
+ Long p90 = metric.getP90();
+ Long p95 = metric.getP95();
+ Long p99 = metric.getP99();
+
+ assertTrue(p90 != null);
+ assertTrue(p95 != null);
+ assertTrue(p99 != null);
+
+ // With 2 values, ceil(0.90 * 2) = 2, so p90 should be the 2nd value
+ // ceil(0.95 * 2) = 2, so p95 should also be the 2nd value
+ // ceil(0.99 * 2) = 2, so p99 should also be the 2nd value
+ }
+
</code_context>
<issue_to_address>
**issue (testing):** Strengthen `testSmallCountPercentiles` by asserting the expected percentile values, not just non-nullity
The test currently only checks that p90/p95/p99 are non-null, while the comments describe specific expected values (each should effectively be the second value). To align the test with those expectations, add assertions that the percentiles are close to the 9 ms value, e.g.:
```java
long expected = 9_000_000L; // 9ms
assertTrue(Math.abs(p90 - expected) <= 1_000_000, "p90 was " + p90);
assertTrue(Math.abs(p95 - expected) <= 1_000_000, "p95 was " + p95);
assertTrue(Math.abs(p99 - expected) <= 1_000_000, "p99 was " + p99);
```
(or similar bounds based on the bucketing). This way the test actually verifies the small-sample percentile behavior.
</issue_to_address>
### Comment 5
<location> `presto-common/src/test/java/com/facebook/presto/common/TestRuntimeMetric.java:144-153` </location>
<code_context>
+ assertEquals(p90 >= 88_000_000 && p90 <= 92_000_000, true);
+ }
+
+ @Test
+ public void testPercentileMergeWith()
+ {
+ RuntimeMetric metric1 = new RuntimeMetric(TEST_METRIC_NAME, NANO, true);
+ RuntimeMetric metric2 = new RuntimeMetric(TEST_METRIC_NAME, NANO, true);
+
+ // Add values with percentile tracking (in nanoseconds)
+ for (int i = 0; i < 50; i++) {
+ metric1.addValue(i * 1_000_000);
+ }
+ for (int i = 50; i < 100; i++) {
+ metric2.addValue(i * 1_000_000);
+ }
+
+ metric1.mergeWith(metric2);
+
+ // Verify merged count
+ assertEquals(metric1.getCount(), 100);
+ assertEquals(metric1.isPercentileTrackingEnabled(), true);
+
+ // Verify percentiles are available after merge
+ assertEquals(metric1.getP90() != null, true);
+ assertEquals(metric1.getP95() != null, true);
+ assertEquals(metric1.getP99() != null, true);
+ }
+
+ @Test
</code_context>
<issue_to_address>
**suggestion (testing):** Add a test for JSON backward compatibility when percentile-related fields are absent
The current test only covers round-tripping with percentile fields present. Please also add a test that deserializes JSON in the old schema (name/unit/sum/count/max/min only, no numBuckets/bucketWidth/percentiles) into `RuntimeMetric`, and verifies that:
- those core fields are correctly populated, and
- percentile tracking is effectively disabled (e.g., `getP90()` returns `null`, no exceptions).
This will protect the `@JsonCreator` path and ensure we remain compatible with older serialized data.
Suggested implementation:
```java
import static org.testng.Assert.assertEquals;
import static org.testng.Assert.assertFalse;
import static org.testng.Assert.assertTrue;
import com.fasterxml.jackson.databind.ObjectMapper;
```
I only see a partial view of `TestRuntimeMetric`, so I can’t safely anchor a search/replace around the end of the class to insert the new test method. Please add the following test method near the existing percentile-related tests (for example, close to `testPercentilesDisabledByDefault` and `testPercentileMergeWith`):
```java
@Test
public void testJsonBackwardCompatibilityWithoutPercentiles()
throws Exception
{
// Old schema JSON: no percentile-related fields (numBuckets, bucketWidth, percentiles, etc.)
String json = "{"
+ "\"name\":\"" + TEST_METRIC_NAME + "\","
+ "\"unit\":\"" + NANO + "\","
+ "\"sum\":600,"
+ "\"count\":3,"
+ "\"max\":300,"
+ "\"min\":100"
+ "}";
ObjectMapper mapper = new ObjectMapper();
RuntimeMetric metric = mapper.readValue(json, RuntimeMetric.class);
// Core fields should be correctly populated
assertEquals(metric.getName(), TEST_METRIC_NAME);
assertEquals(metric.getUnit(), NANO);
assertEquals(metric.getSum(), 600.0);
assertEquals(metric.getCount(), 3L);
assertEquals(metric.getMax(), 300.0);
assertEquals(metric.getMin(), 100.0);
// Percentile tracking should effectively be disabled
assertFalse(metric.isPercentileTrackingEnabled());
assertEquals(metric.getP90(), null);
assertEquals(metric.getP95(), null);
assertEquals(metric.getP99(), null);
}
```
If `sum`, `max`, or `min` in `RuntimeMetric` are typed as `long` instead of `double`, change the corresponding expected values from `600.0`, `300.0`, and `100.0` to `600L`, `300L`, and `100L`. Place this test in the same class (`TestRuntimeMetric`) alongside the other `@Test` methods so it exercises the `@JsonCreator` deserialization path with the old JSON schema.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
presto-common/src/test/java/com/facebook/presto/common/TestRuntimeMetric.java
Show resolved
Hide resolved
presto-common/src/main/java/com/facebook/presto/common/RuntimeMetric.java
Show resolved
Hide resolved
4d6f8ae to
8764f08
Compare
Description
Motivation and Context
Impact
Test Plan
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.