Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of segment metadata cache on Overlord #17785

Merged
merged 16 commits into from
Mar 26, 2025

Conversation

kfaraz
Copy link
Contributor

@kfaraz kfaraz commented Mar 8, 2025

Description

#17653 introduces a cache for segment metadata on the Overlord.
This patch is a follow up to that to make the cache more robust, performant and debug-friendly.

Changes

  • Do not cache unused segments
    This significantly reduces sync time in cases where the cluster has a lot of unused segments.
    Unused segments are needed only during segment allocation to ensure that a duplicate ID is not allocated.
    This is a rare DB query which is supported by sufficient indexes and thus need not be cached at the moment.
  • Update cache directly when segments are marked as unused to avoid race conditions with DB sync.
  • Fix NPE when using segment metadata cache with concurrent locks.
  • Atomically update segment IDs and pending segments in a HeapMemoryDatasourceSegmentCache
    using methods syncSegmentIds() and syncPendingSegments() rather than updating one by one.
    This ensures that the locks are held for a shorter period and the update made to the cache is atomic.

Classes to review

  • IndexerMetadataStorageCoordinator
  • OverlordDataSourcesResource
  • HeapMemorySegmentMetadataCache
  • HeapMemoryDatasourceSegmentCache

Cleaner cache sync

In every sync, the following steps are performed for each datasource:

  • Retrieve ALL used segment IDs from metadata store
  • Atomically update segment IDs in cache and determine list of segment IDs which need to be refreshed.
  • Fetch payloads of segments that need to be refreshed
  • Atomically update fetched payloads into the cache
  • Fetch ALL pending segments
  • Atomically update pending segments into the cache
  • Clean up empty intervals from datasource caches

Testing

Verified the changes in this patch on the following clusters:

. Test cluster Staging cluster
Metadata store MySQL MySQL
Master node m5.large (2 vCPUs, 8 GB) m5.large (2 vCPUs, 8 GB)
Used segments 600k 120k
Unused segments 180k 1.5M
Pending segments 15k 1k
Old sync time 2500s 50-150s
New full sync time 850s 25s
New delta sync time 150s 5s

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Sorry, something went wrong.

@kfaraz kfaraz changed the title Use IndexerMetadataStorageCoordinator on Overlord to mark segments as unused Update cache on Overlord directly when marking segments as unused Mar 9, 2025
@kgyrtkirk kgyrtkirk closed this Mar 13, 2025
@kgyrtkirk kgyrtkirk reopened this Mar 13, 2025
kfaraz added 5 commits March 14, 2025 17:56
@kfaraz kfaraz changed the title Update cache on Overlord directly when marking segments as unused Improve performance of segment metadata cache on Overlord Mar 17, 2025
kfaraz added 4 commits March 24, 2025 15:31
@@ -66,19 +67,22 @@ public class OverlordDataSourcesResource
private static final Logger log = new Logger(OverlordDataSourcesResource.class);

private final SegmentsMetadataManager segmentsMetadataManager;
private final IndexerMetadataStorageCoordinator metadataStorageCoordinator;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that the SegmentsMetadataManager is the Coordinator version of a metadata cache. Do you think we're moving towards getting rid of it completely on the Overlord side of things, in favor of using only IndexerMetadataStorageCoordinator?

It looks like after this patch, the SegmentsMetadataManager is still used in two places on the OL:

  • this class, for markAsUsedNonOvershadowedSegmentsInInterval, markAsUsedNonOvershadowedSegments, and markSegmentAsUsed.
  • in OverlordCompactionScheduler, for getSnapshotOfDataSourcesWithAllUsedSegments. to me it seems like this could already be replaced by IndexerSQLMetadataStorageCoordinator#retrieveAllUsedSegments. (although the SegmentsMetadataManager version would perform better, since it caches the snapshot, and the IndexerSQLMetadataStorageCoordinator version needs to build a timeline. I am not sure how much this matters.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, @gianm , I plan to get rid of SqlSegmentsMetadataManager from the Overlord completely.
Coordinator will continue to use it for the time being, but eventually we might just get rid of it altogether.

  • The usage of SqlSegmentsMetadataManager by OverlordCompactionScheduler does matter for performance because of the timeline and caching reasons as you guessed, but I should be able to resolve this soon.

@@ -167,4 +169,24 @@ public static SegmentId getValidSegmentId(String dataSource, String serializedSe
return parsedSegmentId;
}
}

/**
* Tries to parse the given serialized ID as {@link SegmentId}s of the given
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is suspicious to me, since it encourages ignoring seemingly-invalid segment IDs. Where are the places that we are encountering segment IDs that may not be valid? Is it ok to ignore them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess it would be better to error out on invalid segment IDs.
The only place these can come from are REST APIs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

}
}
catch (Exception e) {
log.noStackTrace().error(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When might this happen? Does it mean we will ignore segments that we can't read? That would be worrying, because Exception is very broad. Is there some kind of scenario you have in mind where a failure is possible?

Also, .noStackTrace().error is not a combination I expect to see. Typically error is reserved for serious problems, and for serious problems we should always have a stack trace.

Copy link
Contributor Author

@kfaraz kfaraz Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When might this happen? Does it mean we will ignore segments that we can't read? That would be worrying, because Exception is very broad. Is there some kind of scenario you have in mind where a failure is possible?

This can happen in the rare case when the segment payload has been tampered or some other column was not parseable. It is not frequent but it can happen, as I only recently encountered this in a prod DB.

We are not throwing an error here so that the processing can continue with the rest of the segments.
Even though this segment is ignored here, it actually increments a skippedCount in the code where this class is used and we raise an alert if skippedCount > 0. The log here is just to allow the operator to go through Overlord logs and see which segment IDs failed and why.

Do you think it would be better to create an alert for each failing segment ID separately? (similar to SqlSegmentsMetadataManager
I felt it would be too noisy, so alerted at total count level instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, .noStackTrace().error is not a combination I expect to see. Typically error is reserved for serious problems, and for serious problems we should always have a stack trace.

Yeah, I see your point. I will include the stack trace here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be better to create an alert for each failing segment ID separately? (similar to SqlSegmentsMetadataManager
I felt it would be too noisy, so alerted at total count level instead.

I think the way you did it here is OK. It's good that there is an alert, and IMO an alert with a number of bad segments is ok.

@kgyrtkirk kgyrtkirk added this to the 33.0.0 milestone Mar 26, 2025
kfaraz added 2 commits March 26, 2025 14:21
@kfaraz
Copy link
Contributor Author

kfaraz commented Mar 26, 2025

Verified the changes in this patch on the following clusters:

. Test cluster Staging cluster
Metadata store MySQL MySQL
Master node m5.large (2 vCPUs, 8 GB) m5.large (2 vCPUs, 8 GB)
Used segments 600k 120k
Unused segments 180k 1.5M
Pending segments 15k 1k
Old sync time 2500s 50-150s
New full sync time 850s 25s
New delta sync time 150s 5s

@cryptoe
Copy link
Contributor

cryptoe commented Mar 26, 2025

@kfaraz
It looks like the new full sync time goes from
25 seconds for 120K used segments
to
850 seconds for 600K used segments .
Could you provide some clarifications around the non linear time increase.

Also could you provide some description around what is the delta sync?

@kfaraz
Copy link
Contributor Author

kfaraz commented Mar 26, 2025

@cryptoe , those are two separate clusters and the results are not really comparable between the two since the underlying metadata store is different.

The comparison is to be done between the old sync time and the new sync time for any given cluster.

Also, the 850s for 600k segments is a little pessimistic, since a real production cluster with say 1M segments would have a much beefier metadata store and master node.
Based on what I have seen with the metric segment/poll/time on production clusters with 2M segments or more,
a full sync would take like 3 minutes and a delta sync would be much less than that (exactly how much less could vary).

Full sync: fetch all used segment payloads
Delta sync: fetch all used segment IDs, fetch only updated segment payloads

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kfaraz
Copy link
Contributor Author

kfaraz commented Mar 26, 2025

Thanks for the review, @gianm !

@kfaraz kfaraz merged commit c0cc27c into apache:master Mar 26, 2025
75 checks passed
@kfaraz kfaraz deleted the mark_unused_via_cache branch March 26, 2025 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants