Skip to content

[server] Add negative cache for non-existent partition IDs in metadata requests#3511

Open
swuferhong wants to merge 1 commit into
apache:mainfrom
swuferhong:partition-not-exists-cache
Open

[server] Add negative cache for non-existent partition IDs in metadata requests#3511
swuferhong wants to merge 1 commit into
apache:mainfrom
swuferhong:partition-not-exists-cache

Conversation

@swuferhong

Copy link
Copy Markdown
Contributor

Purpose

Linked issue: close #3510

During hourly partition rotation, clients holding stale partition IDs repeatedly trigger
ZooKeeper lookups that always return "not found". This adds unnecessary pressure on ZK.
The negative cache eliminates these redundant queries by remembering the "not found" result
for a configurable TTL period.

Brief change log

Tests

API and Format

Documentation

@swuferhong swuferhong requested a review from loserwang1024 June 23, 2026 06:14

@loserwang1024 loserwang1024 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left some comment

*/
private static final long CLEANUP_INTERVAL_MS = 60_000;

private final ConcurrentHashMap<Long, Long> cache;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not useUse Guava Cache but implementation by yourself?

  1. Unreliable expiration cleanup — maybeCleanup() is only called inside markNonExistent(). If no new partitions are marked as non-existent for a long time (in practice, markNonExistent is only triggered after a ZK query confirms non-existence), expired entries will linger in memory indefinitely. While isKnownNonExistent() lazily removes individual expired entries, it never performs a full sweep.
  2. The project already has shaded Guava, so you can use it directly:
import org.apache.fluss.shaded.guava32.com.google.common.cache.Cache;
import org.apache.fluss.shaded.guava32.com.google.common.cache.CacheBuilder;

private final Cache<Long, Boolean> negativeCache = CacheBuilder.newBuilder()
    .expireAfterAccess(10, TimeUnit.MINUTES)
    .maximumSize(10000)  // prevent OOM in extreme cases
    .build();
  1. Advantages of Guava Cache:
  • Automatic eviction: access-time-based TTL without manual maintenance
  • Bounded size: maximumSize prevents unbounded growth (the current implementation has no size limit)
  • Battle-tested thread safety: no need to roll your own CAS logic

@wuchong ,WDYT?

long[] partitionIds = request.getPartitionsIds();
List<Long> partitionIdsNotExistsInCache = new ArrayList<>();
for (long partitionId : partitionIds) {
// Fast-path: throw immediately for partition IDs known to not exist,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think negative cache check should NOT be placed before metadata cache lookup.

Problem scenario — a race condition exists:

  1. A new partition is being created; its ID has been assigned.
  2. A client sends a metadata request, but the server's metadata cache hasn't synced yet, and ZK may not have the entry written yet either.
  3. Server queries ZK, finds nothing → calls markNonExistent(partitionId).
  4. Partition creation completes; metadata cache is updated via ZK watch.
  5. Subsequent requests are blocked by the negative cache and can never retrieve the now-existing partition.

Suggested fix: Move the negative cache check to after the metadata cache lookup but before the ZK query:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[server] Repeated metadata requests for deleted partitions cause unnecessary ZooKeeper pressure

2 participants