Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Locality-aware chunk distribution #3102

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Conversation

SirYwell
Copy link
Member

@SirYwell SirYwell commented Feb 7, 2025

Overview

Description

ParallelQueueExtent deals with distributing work onto multiple threads to speed up edits. In its current implementation, this is achieved by just using the Region#getChunks().iterator() as a queue and let all threads pull the next chunk to work on from there. This works well typically, but it has one downside: Different threads work on neighboring chunks. There are multiple reasons this isn't optimal:

  • Filters accessing data of surrounding chunks will cause more cache misses in their STQE. As currently this also means that one STQE will trim the chunk cached/processed by another STQE, we do a lot of extra work in such cases.
  • Due to how Minecraft generates chunks, processing neighboring chunks on different threads can be slower if both threads wait for the same resources.

By distributing work more location-aware, these downsides can be prevented, at the cost of a somewhat more complicated work-splitting logic. This means threads will process more tasks, but those tasks are smaller and faster to process. As an additional benefit, this allows better work distribution if multiple edits are done concurrently, as the threads don't have to finish one edit before they are free again for the next edit. Consequently, thread-local data needs to be cleared and re-set more often.

Exception handling is a bit of a question mark with this approach, I tried to get it near to the original handling, but I'm not sure if it makes sense the way it is right now.

Submitter Checklist

Preview Give feedback

Copy link

github-actions bot commented Feb 9, 2025

Please take a moment and address the merge conflicts of your pull request. Thanks!

@SirYwell SirYwell force-pushed the perf/work-distribution branch from 130ef17 to 0f64189 Compare February 12, 2025 16:04
@SirYwell SirYwell requested a review from a team February 12, 2025 17:43
@SirYwell SirYwell marked this pull request as ready for review February 12, 2025 17:43
Copy link
Member

@dordsor21 dordsor21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see a scenario where we end up with one ApplyTask instance effectively only utilising one thread for a 512x512 region of the edit, causing the whole edit to wait for it when everything else is finished. I don't know if there's a particular way around that, but in a complex operation this could be quite a perceived performance hit

}
}

private ForkJoinTask<?>[] postProcess() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is postProcess necessarily the best name here? Given the use of postprocess already to define actions per-chunk. Perhaps just flushQueues is fine as we're not doing any processing (postly or otherwise) here

for (int regionX = minRegionX; regionX <= maxRegionX; regionX++) {
for (int regionZ = minRegionZ; regionZ <= maxRegionZ; regionZ++) {
if (ForkJoinTask.getSurplusQueuedTaskCount() > Settings.settings().QUEUE.PARALLEL_THREADS) {
// assume we can do a bigger batch of work here - the other threads are busy for a while
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would replace can with should. Whilst it's a trade-off between cross-thread mask access, I would still call higher parallelism the "wanted" outcome

@SirYwell
Copy link
Member Author

I can see a scenario where we end up with one ApplyTask instance effectively only utilising one thread for a 512x512 region of the edit, causing the whole edit to wait for it when everything else is finished. I don't know if there's a particular way around that, but in a complex operation this could be quite a perceived performance hit

Yes, that's possible in theory, and the heuristic when to split is rather simple currently. In practice, it would mean the thread pool is already busy and there are even more tasks waiting. We could e.g. force splitting on 512x512 regions, or check ForkJoinTask.getSurplusQueuedTaskCount() > Settings.settings().QUEUE.PARALLEL_THREADS * f(this.shift) (where f is some "damping" function) to make that case even less likely.

I also though about re-checking the heuristics during processing, but that complicates things too much I think.

@dordsor21
Copy link
Member

I can see a scenario where we end up with one ApplyTask instance effectively only utilising one thread for a 512x512 region of the edit, causing the whole edit to wait for it when everything else is finished. I don't know if there's a particular way around that, but in a complex operation this could be quite a perceived performance hit

Yes, that's possible in theory, and the heuristic when to split is rather simple currently. In practice, it would mean the thread pool is already busy and there are even more tasks waiting. We could e.g. force splitting on 512x512 regions, or check ForkJoinTask.getSurplusQueuedTaskCount() > Settings.settings().QUEUE.PARALLEL_THREADS * f(this.shift) (where f is some "damping" function) to make that case even less likely.

I also though about re-checking the heuristics during processing, but that complicates things too much I think.

If we are processing a larger region by smaller region at a time in the first place, rather than the for x; for z I think it would be much more reasonable to check if we want to parallelise further at the end of the completion of each subregion, and can therefore submit the rest. E.g. create a queue at the start of a region's processing and just submit the remaining as/when needed

@SirYwell
Copy link
Member Author

If we are processing a larger region by smaller region at a time in the first place, rather than the for x; for z I think it would be much more reasonable to check if we want to parallelise further at the end of the completion of each subregion, and can therefore submit the rest. E.g. create a queue at the start of a region's processing and just submit the remaining as/when needed

Hm I guess processing chunks using a hilbert curve might be useful there. Otherwise, just always splitting and then unforking is basically that, and the overhead of creating that many tasks eagerly is most likely not a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants