Improve performance of segment metadata cache on Overlord #17785

kfaraz · 2025-03-08T12:00:31Z

Description

#17653 introduces a cache for segment metadata on the Overlord.
This patch is a follow up to that to make the cache more robust, performant and debug-friendly.

Changes

Do not cache unused segments
This significantly reduces sync time in cases where the cluster has a lot of unused segments.
Unused segments are needed only during segment allocation to ensure that a duplicate ID is not allocated.
This is a rare DB query which is supported by sufficient indexes and thus need not be cached at the moment.
Update cache directly when segments are marked as unused to avoid race conditions with DB sync.
Fix NPE when using segment metadata cache with concurrent locks.
Atomically update segment IDs and pending segments in a HeapMemoryDatasourceSegmentCache
using methods syncSegmentIds() and syncPendingSegments() rather than updating one by one.
This ensures that the locks are held for a shorter period and the update made to the cache is atomic.

Classes to review

IndexerMetadataStorageCoordinator
OverlordDataSourcesResource
HeapMemorySegmentMetadataCache
HeapMemoryDatasourceSegmentCache

Cleaner cache sync

In every sync, the following steps are performed for each datasource:

Retrieve ALL used segment IDs from metadata store
Atomically update segment IDs in cache and determine list of segment IDs which need to be refreshed.
Fetch payloads of segments that need to be refreshed
Atomically update fetched payloads into the cache
Fetch ALL pending segments
Atomically update pending segments into the cache
Clean up empty intervals from datasource caches

Testing

Verified the changes in this patch on the following clusters:

.	Test cluster	Staging cluster
Metadata store	MySQL	MySQL
Master node	m5.large (2 vCPUs, 8 GB)	m5.large (2 vCPUs, 8 GB)
Used segments	600k	120k
Unused segments	180k	1.5M
Pending segments	15k	1k
Old sync time	2500s	50-150s
New full sync time	850s	25s
New delta sync time	150s	5s

This PR has:

… unused

…_cache

gianm · 2025-03-25T19:00:56Z

...rvice/src/main/java/org/apache/druid/indexing/overlord/http/OverlordDataSourcesResource.java

@@ -66,19 +67,22 @@ public class OverlordDataSourcesResource
  private static final Logger log = new Logger(OverlordDataSourcesResource.class);

  private final SegmentsMetadataManager segmentsMetadataManager;
+  private final IndexerMetadataStorageCoordinator metadataStorageCoordinator;


I just realized that the SegmentsMetadataManager is the Coordinator version of a metadata cache. Do you think we're moving towards getting rid of it completely on the Overlord side of things, in favor of using only IndexerMetadataStorageCoordinator?

It looks like after this patch, the SegmentsMetadataManager is still used in two places on the OL:

this class, for markAsUsedNonOvershadowedSegmentsInInterval, markAsUsedNonOvershadowedSegments, and markSegmentAsUsed.

in OverlordCompactionScheduler, for getSnapshotOfDataSourcesWithAllUsedSegments. to me it seems like this could already be replaced by IndexerSQLMetadataStorageCoordinator#retrieveAllUsedSegments. (although the SegmentsMetadataManager version would perform better, since it caches the snapshot, and the IndexerSQLMetadataStorageCoordinator version needs to build a timeline. I am not sure how much this matters.)

Yes, @gianm , I plan to get rid of SqlSegmentsMetadataManager from the Overlord completely.
Coordinator will continue to use it for the time being, but eventually we might just get rid of it altogether.

The usage of SqlSegmentsMetadataManager by OverlordCompactionScheduler does matter for performance because of the timeline and caching reasons as you guessed, but I should be able to resolve this soon.

gianm · 2025-03-25T19:04:43Z

processing/src/main/java/org/apache/druid/common/utils/IdUtils.java

@@ -167,4 +169,24 @@ public static SegmentId getValidSegmentId(String dataSource, String serializedSe
      return parsedSegmentId;
    }
  }
+
+  /**
+   * Tries to parse the given serialized ID as {@link SegmentId}s of the given


This method is suspicious to me, since it encourages ignoring seemingly-invalid segment IDs. Where are the places that we are encountering segment IDs that may not be valid? Is it ok to ignore them?

Yeah, I guess it would be better to error out on invalid segment IDs.
The only place these can come from are REST APIs.

gianm · 2025-03-25T20:22:45Z

server/src/main/java/org/apache/druid/metadata/segment/cache/SegmentRecord.java

+      }
+    }
+    catch (Exception e) {
+      log.noStackTrace().error(


When might this happen? Does it mean we will ignore segments that we can't read? That would be worrying, because Exception is very broad. Is there some kind of scenario you have in mind where a failure is possible?

Also, .noStackTrace().error is not a combination I expect to see. Typically error is reserved for serious problems, and for serious problems we should always have a stack trace.

When might this happen? Does it mean we will ignore segments that we can't read? That would be worrying, because Exception is very broad. Is there some kind of scenario you have in mind where a failure is possible?

This can happen in the rare case when the segment payload has been tampered or some other column was not parseable. It is not frequent but it can happen, as I only recently encountered this in a prod DB.

We are not throwing an error here so that the processing can continue with the rest of the segments.
Even though this segment is ignored here, it actually increments a skippedCount in the code where this class is used and we raise an alert if skippedCount > 0. The log here is just to allow the operator to go through Overlord logs and see which segment IDs failed and why.

Do you think it would be better to create an alert for each failing segment ID separately? (similar to SqlSegmentsMetadataManager
I felt it would be too noisy, so alerted at total count level instead.

Also, .noStackTrace().error is not a combination I expect to see. Typically error is reserved for serious problems, and for serious problems we should always have a stack trace.

Yeah, I see your point. I will include the stack trace here.

Do you think it would be better to create an alert for each failing segment ID separately? (similar to SqlSegmentsMetadataManager
I felt it would be too noisy, so alerted at total count level instead.

I think the way you did it here is OK. It's good that there is an alert, and IMO an alert with a number of bad segments is ok.

kfaraz · 2025-03-26T14:04:12Z

Verified the changes in this patch on the following clusters:

.	Test cluster	Staging cluster
Metadata store	MySQL	MySQL
Master node	m5.large (2 vCPUs, 8 GB)	m5.large (2 vCPUs, 8 GB)
Used segments	600k	120k
Unused segments	180k	1.5M
Pending segments	15k	1k
Old sync time	2500s	50-150s
New full sync time	850s	25s
New delta sync time	150s	5s

cryptoe · 2025-03-26T15:13:23Z

@kfaraz
It looks like the new full sync time goes from
25 seconds for 120K used segments
to
850 seconds for 600K used segments .
Could you provide some clarifications around the non linear time increase.

Also could you provide some description around what is the delta sync?

kfaraz · 2025-03-26T15:16:11Z

@cryptoe , those are two separate clusters and the results are not really comparable between the two since the underlying metadata store is different.

The comparison is to be done between the old sync time and the new sync time for any given cluster.

Also, the 850s for 600k segments is a little pessimistic, since a real production cluster with say 1M segments would have a much beefier metadata store and master node.
Based on what I have seen with the metric segment/poll/time on production clusters with 2M segments or more,
a full sync would take like 3 minutes and a delta sync would be much less than that (exactly how much less could vary).

Full sync: fetch all used segment payloads
Delta sync: fetch all used segment IDs, fetch only updated segment payloads

gianm

LGTM

kfaraz · 2025-03-26T23:24:45Z

Thanks for the review, @gianm !

Use IndexerMetadataStorageCoordinator on Overlord to mark segments as…

Loading
Loading status checks…

8761bef

… unused

github-actions bot added the Area - Ingestion label Mar 8, 2025

Add unit tests

Loading
Loading status checks…

b859df1

kfaraz changed the title ~~Use IndexerMetadataStorageCoordinator on Overlord to mark segments as unused~~ Update cache on Overlord directly when marking segments as unused Mar 9, 2025

kfaraz added 3 commits March 11, 2025 16:18

Make cache updates atomic, add some metrics

7087263

Merge branch 'master' of github.com:apache/druid into mark_unused_via…

1b75090

…_cache

More cleanup

Loading
Loading status checks…

68b612f

kgyrtkirk closed this Mar 13, 2025

kgyrtkirk reopened this Mar 13, 2025

kfaraz added 5 commits March 14, 2025 17:56

Simplify datasource-level cache

Loading
Loading status checks…

1dfc781

Merge branch 'master' of github.com:apache/druid into mark_unused_via…

Loading
Loading status checks…

bff3c21

…_cache

Fix read result set

Loading
Loading status checks…

e342aee

Emit cache stats

Loading
Loading status checks…

a3409b4

Fix sync for interval with unused segments

Loading
Loading status checks…

075237b

kfaraz changed the title ~~Update cache on Overlord directly when marking segments as unused~~ Improve performance of segment metadata cache on Overlord Mar 17, 2025

kfaraz added 4 commits March 24, 2025 15:31

Read from cache or store

4dcb167

Do not cache unused segments

Loading
Loading status checks…

f8d68c1

Fix result set mapper

Loading
Loading status checks…

9da50f3

Reduce first sync time

Loading
Loading status checks…

0bbf97c

gianm reviewed Mar 25, 2025

View reviewed changes

kgyrtkirk added this to the 33.0.0 milestone Mar 26, 2025

kfaraz added 2 commits March 26, 2025 14:21

Error out on invalid segment IDs

Loading
Loading status checks…

abf1d02

Fix kill task test

Loading
Loading status checks…

1399572

gianm approved these changes Mar 26, 2025

View reviewed changes

kfaraz merged commit c0cc27c into apache:master Mar 26, 2025
75 checks passed

kfaraz deleted the mark_unused_via_cache branch March 26, 2025 23:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of segment metadata cache on Overlord #17785

Improve performance of segment metadata cache on Overlord #17785

kfaraz commented Mar 8, 2025 •

edited

Loading

gianm Mar 25, 2025

kfaraz Mar 25, 2025

gianm Mar 25, 2025

kfaraz Mar 25, 2025

kfaraz Mar 26, 2025

gianm Mar 25, 2025

kfaraz Mar 25, 2025 •

edited

Loading

kfaraz Mar 25, 2025

kfaraz Mar 26, 2025

gianm Mar 26, 2025

kfaraz commented Mar 26, 2025

cryptoe commented Mar 26, 2025

kfaraz commented Mar 26, 2025 •

edited

Loading

gianm left a comment

kfaraz commented Mar 26, 2025

Improve performance of segment metadata cache on Overlord #17785

Improve performance of segment metadata cache on Overlord #17785

Conversation

kfaraz commented Mar 8, 2025 • edited Loading

Description

Changes

Classes to review

Cleaner cache sync

Testing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfaraz Mar 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfaraz commented Mar 26, 2025

cryptoe commented Mar 26, 2025

kfaraz commented Mar 26, 2025 • edited Loading

gianm left a comment

Choose a reason for hiding this comment

kfaraz commented Mar 26, 2025

kfaraz commented Mar 8, 2025 •

edited

Loading

kfaraz Mar 25, 2025 •

edited

Loading

kfaraz commented Mar 26, 2025 •

edited

Loading