KAFKA-19763: Parallel remote reads cause memory leak in broker #20654

kamalcph · 2025-10-07T15:13:33Z

Broker heap memory gets filled up and throws OOM error when remote reads
are triggered for multiple partitions within a FETCH request.

Steps to reproduce:

Start a one node broker and configure LocalTieredStorage as remote
storage.
Create a topic with 5 partitions.
Produce message and ensure that few segments are uploaded to remote.
Start a consumer to read from those 5 partitions. Seek the offset to
beginning for 4 partitions and to end for 1 partition. This is to
simulate that the FETCH request read from both remote-log and local-log.
The broker crashes with the OOM error.
The DelayedRemoteFetch / RemoteLogReadResult references are being
held by the purgatory, so the broker crashes.

Reviewers: Luke Chen [email protected], Satish Duggana
[email protected]

kamalcph · 2025-10-07T15:16:02Z

@satishd @showuon

PTAL, will cover the patch with unit test. Thanks!

Shall we also reduce the purgeInterval for DelayedRemoteFetch purgatory from 1000 to 100? This is to cleanup any left-over completed delayed operations from the watcher.

showuon · 2025-10-08T03:50:30Z

core/src/main/scala/kafka/server/ReplicaManager.scala


    // create a list of (topic, partition) pairs to use as keys for this delayed fetch operation
-    val delayedFetchKeys = remoteFetchPartitionStatus.map { case (tp, _) => new TopicPartitionOperationKey(tp) }.toList
+    val delayedFetchKeys = remoteFetchTasks.asScala.map { case (tp, _) => new TopicPartitionOperationKey(tp) }.toList


Sorry, I don't get the fix. I'm really surprised this fixes the memory leak. Could you explain why it leaks memory here? I thought we only use the tp in remoteFetchPartitionStatus to create the delayedFetchKeys list, where is the leak? We don't hold the FetchPartitionStatus at all, right?

Thanks for the review!

While writing the unit test, realized that this fix does not solve the problem fully.

The remoteFetchPartitionStatus might contain the partitions that are for local-reads, so those keys won't complete at all.

The fix reduces the severity of the issue by maintaining the keys only for the remote-request issued.

The DelayedRemoteFetch operation completes only when all the keys are complete. So, if we are watching for 4 keys (tp0, tp1, tp2, and tp3). Only for the last key, the reference gets removed. Rest of them, it retains.

The next FETCH request for the same set of partitions clears the DelayedRemoteFetch references from the previous FETCH request.

There would be leak for one DelayedRemoteFetch object per partition when it transits from remote-to-local log. We can reduce the purgeInterval (entries) from 1000 to 10 so that reaper threads can clear those completed delayed operation references.

Thanks for the explanation! Yes, you're right. If we have purgeInterval = 1000, and each operation has 1 remote -> local log, and each has 1 MiB buffer reference, that means we could leak ~ 1 GB of memory. Reducing the purgeInterval is a way to resolve the issue I agree. But since each operation could contain more than 1 key (partition), I want to see if we can have a better solution for that. Let me think about it.

If we go with the solution to reduce the purgeInterval from 1000 to 10. The worst case, we will keep up to 10 remote fetched records in memory, and each fetched total record size is bounded by fetch.max.bytes (default 50MiB), so total 500 MiB. If we want to reduce the purgeInterval, I'd suggest we set to 0, so that we make sure all completed operation are removed. The trade-off is that if the same key added soon (ex: the same remote fetch partition comes in) , it needs to re-create an entry in a concurrentMap lock.

Another solution I came out, is that we added a lastModifiedMs in WatcherList (we now have 512 WatcherList). And when we advanceClock, we not only check the purgeInterval, we also do the purge when WatcherList has no update after a timeout (ex: 1 second?). This will make sure the active watchers will stay in the cache to avoid the need to recreate an entry in concurrentMap.

WDYT?

is that we added a lastModifiedMs in WatcherList

We can add lastModifiedMs to the watcherList. But, there can be multiple keys that can map to the same watcherList and the lastModifiedTime gets updated on each invocation. So, it may not fully address the problem. If the value is kept to low like 1 second, then the chance of leak might go down drastically.

The new attribute lastModifiedMs can be introduced in a separate PR if required since the DelayedOperationPurgatory is used by multiple components. (to make revert easier if something goes off).

it needs to re-create an entry in a concurrentMap lock.

The purgeInterval can be reduced to 0. The watchForOperation already takes a lock. This trade-off to re-create an entry looks fine to me since the remote operation are not expected to be too-frequent. If this is fine, then I'll reduce the purgeInterval to 0.

If we agree to reduce purgeInterval to 0, then the lastModifiedMs is not needed anymore. @satishd @chia7712 , do you think reducing purgeInterval to 0 will cause any unexpected result?

Reaper thread is invoked every 200 ms. So, the check for purged entries is done every 200 ms as the purge interval is 0. It goes through all the registered entries and finds the completed operations. We can go ahead with this change for now. We can have any followups required in the followup PR if needed.

core/src/main/scala/kafka/server/ReplicaManager.scala

core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala

danish-ali

The switch from remoteFetchPartitionStatus → remoteFetchTasks.asScala.map { tp … } makes sense since it limits keys to actual remote requests.
Could we (1) add a targeted test that forces a partition to transition remote → local mid-fetch and asserts that the purgatory entry is removed (or reaped) after completion, and (2) document in code that we only guarantee eventual cleanup (via next FETCH or reaper)?
A short comment where delayedFetchKeys is built would help future readers.

core/src/main/scala/kafka/server/ReplicaManager.scala

kamalcph · 2025-10-11T06:56:17Z

Addressed the review comments. PTAL.

@showuon Tried enabling the remoteFetchReaperEnabled as true in testRemoteLogReaderMetrics, the test fails to assert that RemoteLogReaderFetchRateAndTimeMs is 5. Seems to be a metric issue. Could you take a look? Thanks!

Tried changing the remote.fetch.max.wait.ms to 5s in testRemoteLogReaderMetrics, it didn't help:

 props.put(RemoteLogManagerConfig.REMOTE_FETCH_MAX_WAIT_MS_PROP, 5000.toString)
 ...
  val replicaManager = setupReplicaManagerWithMockedPurgatories(new MockTimer(time), aliveBrokerIds = Seq(0, 1, 2),
      propsModifier = props => props.put(RemoteLogManagerConfig.REMOTE_FETCH_MAX_WAIT_MS_PROP, 5000.toString), 
      enableRemoteStorage = true, shouldMockLog = true, remoteLogManager = Some(spyRLM), remoteFetchReaperEnabled = true)

kamalcph · 2025-10-11T09:30:28Z

Fixed the testRemoteLogReaderMetrics. PTAL. Thanks!

showuon

Thanks for adding tests.

core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala

core/src/main/scala/kafka/server/ReplicaManager.scala

kamalcph · 2025-10-14T06:34:08Z

@satishd @chia7712

Call for review. PTAL. Thanks!

showuon

LGTM! Thanks for finding and fixing the bug!

satishd

Thanks @kamalcph for identifying the issue and our offline discussion in fixing this issue. Overall LGTM.

satishd · 2025-10-15T14:21:02Z

core/src/main/scala/kafka/server/ReplicaManager.scala

        readResult.info.delayedRemoteStorageFetch.get.fetchMaxBytes else recordBatchSize
+      // Once we read from a non-empty partition, we stop ignoring request and partition level size limits
+      if (estimatedRecordBatchSize > 0)
+        minOneMessage = false


Let us keep this PR only for the memory leak issue in remote reads. Please have another PR related to minOneMessage scenario.

Opened #20706 to fix minOneMessage case.

@showuon
One additional change in #20706 is that remoteFetchInfos changed from HashMap to LinkedHashMap to preserve the order of partitions while sending the remote fetches. PTAL.

Thanks for the review!

… the first partition. - set the purgeInterval to 10

…trics

…SystemTimer

satishd

Thanks @kamalcph for addressing the review comments. LGTM.

github-actions bot added core Kafka Broker triage PRs from the community small Small PRs labels Oct 7, 2025

kamalcph requested a review from showuon October 7, 2025 15:14

kamalcph requested review from chia7712 and satishd October 7, 2025 15:21

kamalcph force-pushed the KAFKA-19763 branch from 5b4067e to 8bf7e77 Compare October 8, 2025 11:01

github-actions bot removed the small Small PRs label Oct 8, 2025

showuon reviewed Oct 8, 2025

View reviewed changes

core/src/main/scala/kafka/server/ReplicaManager.scala Show resolved Hide resolved

core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala Outdated Show resolved Hide resolved

github-actions bot removed the triage PRs from the community label Oct 9, 2025

danish-ali reviewed Oct 10, 2025

View reviewed changes

showuon reviewed Oct 10, 2025

View reviewed changes

core/src/main/scala/kafka/server/ReplicaManager.scala Show resolved Hide resolved

kamalcph force-pushed the KAFKA-19763 branch from f8d8709 to ef8b0cd Compare October 11, 2025 06:45

showuon reviewed Oct 11, 2025

View reviewed changes

core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala Show resolved Hide resolved

core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala Outdated Show resolved Hide resolved

core/src/main/scala/kafka/server/ReplicaManager.scala Show resolved Hide resolved

showuon approved these changes Oct 14, 2025

View reviewed changes

satishd reviewed Oct 15, 2025

View reviewed changes

kamalcph added 9 commits October 15, 2025 20:09

KAFKA-19763: Parallel remote reads cause memory leak in broker

1c6f2f1

Add unit tests

fe28d13

mark the minOneMessage as false when delayedRemoteFetch is present in…

b545caf

… the first partition. - set the purgeInterval to 10

remove the TODO test

a5c8747

Addressed the review comments

ca1091d

Use SystemTimer when enabling the reaper to fix testRemoteLogReaderMe…

c5c7eec

…trics

Increased the remote.max.wait.ms to avoid flaky test due to usage of …

2fc615b

…SystemTimer

Addressed the review-2 comments

0ee2bf2

Remove minOneMessage change

4c8819d

kamalcph force-pushed the KAFKA-19763 branch from 7e14d5c to 4c8819d Compare October 15, 2025 14:44

kamalcph mentioned this pull request Oct 15, 2025

KAFKA-19795: Mark the minOneMessage as false when delayedRemoteFetch is present in the first partition. #20706

Merged

satishd approved these changes Oct 15, 2025

View reviewed changes

kamalcph merged commit c58cf1d into apache:trunk Oct 15, 2025
25 checks passed

kamalcph deleted the KAFKA-19763 branch October 15, 2025 16:52

Uh oh!

KAFKA-19763: Parallel remote reads cause memory leak in broker #20654

KAFKA-19763: Parallel remote reads cause memory leak in broker #20654

Conversation

kamalcph commented Oct 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kamalcph commented Oct 7, 2025

Uh oh!

showuon Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

kamalcph Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

showuon Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

showuon Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kamalcph Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

showuon Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

satishd Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

danish-ali left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kamalcph commented Oct 11, 2025

Uh oh!

kamalcph commented Oct 11, 2025

Uh oh!

showuon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kamalcph commented Oct 14, 2025

Uh oh!

showuon left a comment

Choose a reason for hiding this comment

Uh oh!

satishd left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

satishd Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

kamalcph Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

satishd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kamalcph commented Oct 7, 2025 •

edited by github-actions bot

Loading

showuon Oct 9, 2025 •

edited

Loading

kamalcph Oct 9, 2025 •

edited

Loading

satishd left a comment •

edited

Loading