Accelerate `HashMap` and `HashSet` lookup by using `if` based modulo in loops #106569

Ivorforce · 2025-05-18T18:05:57Z

HashMap uses modulo to map hashes and positions to the 0 to capacity range.
Modulo is slow, so it was exchanged for fastmod in #62327.
However, fastmod is still a bottleneck for HashMap lookup. I was able to accelerate calls by using an if pseudo modulo instead, which is faster than fastmod.

This change can accelerate HashMap lookup by 40%.

Explanation

On my machine, fastmod uses 2 multiplications (3-10 clock cycles¹ each) and a bitshift (1 clock cycle²).

An if is an integer comparison (1 clock cycle²) and a mostly predictable branch (0-2 clock cycles³ normally, 12 - 25 clock cycles on modulo case).

It stands to reason that in most cases, an if based modulo will be faster than fastmod. It can be used when we're sure the number is between (0, capacity * 2 - 1).

Benchmarks

Theory is well and good, but benchmarks are better.
I measured an up to 40% improvement in HashMap get, 11% for insert, and 2.5% for erase.
This will be the case notably mostly when the memory is in cache. When it is not in cache, the bottleneck is likely to be RAM access.

Test Code

	for (int size : { 1, 2, 8, 64, 1024, 4096}) {
		{
			size_t time_ns = 0;
			for (int run = 0; run < 20000000 / size; run++) {
				auto t0 = std::chrono::high_resolution_clock::now();
				HashMap<int64_t, int64_t> dictionary;
				for (int idx = 0; idx < size; idx ++) {
					// Test
					dictionary.insert(idx, idx);
				}
				auto t1 = std::chrono::high_resolution_clock::now();
				time_ns += std::chrono::duration_cast<std::chrono::nanoseconds>(t1 - t0).count();
			}

			std::cout << "insert:" << size << std::endl;
			std::cout << time_ns / 1000 / 1000 << "ms\n";
		}
		{
			size_t time_ns = 0;
			for (int run = 0; run < 20000000 / size; run++) {
				HashMap<int64_t, int64_t> dictionary;
				for (int idx = 0; idx < size; idx ++) {
					// Test
					dictionary.insert(idx, idx);
				}
				auto t0 = std::chrono::high_resolution_clock::now();
				for (int idx = 0; idx < size; idx ++) {
					// Test
					dictionary.erase(idx);
				}
				auto t1 = std::chrono::high_resolution_clock::now();
				time_ns += std::chrono::duration_cast<std::chrono::nanoseconds>(t1 - t0).count();
			}

			std::cout << "erase:" << size << std::endl;
			std::cout << time_ns / 1000 / 1000 << "ms\n";
		}
		{
			size_t time_ns = 0;
			for (int run = 0; run < 20000000 / size; run++) {
				HashMap<int64_t, int64_t> dictionary;
				for (int idx = 0; idx < size; idx ++) {
					// Test
					dictionary.insert(idx, idx);
				}
				size_t total = 0;
				auto t0 = std::chrono::high_resolution_clock::now();
				for (int idx = 0; idx < size; idx ++) {
					// Test
					total += dictionary.get(idx);
				}
				auto t1 = std::chrono::high_resolution_clock::now();
				time_ns += std::chrono::duration_cast<std::chrono::nanoseconds>(t1 - t0).count();

				// Prevent compiling this out.
				volatile_size_t = total;
			}

			std::cout << "get:" << size << std::endl;
			std::cout << time_ns / 1000 / 1000 << "ms\n";
		}
	}

On master:

insert:1
912ms
erase:1
522ms
get:1
330ms
insert:2
552ms
erase:2
383ms
get:2
174ms
insert:8
310ms
erase:8
376ms
get:8
86ms
insert:64
406ms
erase:64
462ms
get:64
98ms
insert:1024
511ms
erase:1024
587ms
get:1024
105ms
insert:4096
861ms
erase:4096
643ms
get:4096
272ms

On this PR:

insert:1
913ms
erase:1
521ms
get:1
321ms
insert:2
544ms
erase:2
376ms
get:2
164ms
insert:8
292ms
erase:8
371ms
get:8
61ms
insert:64
376ms
erase:64
459ms
get:64
65ms
insert:1024
475ms
erase:1024
565ms
get:1024
64ms
insert:4096
769ms
erase:4096
627ms
get:4096
190ms

Alternatives

@Nazarwadim has proposed in #90082 to switch to & for modulus. This would make the operation near instantaneous.

However, we currently use prime based capacities, because those decrease collision counts when a mediocre hash strategy is chosen. There is definitely a trade off to be measured some time, but it should be measured with both strategies at an ideal implementation.

Since we know our hash functions are high quality, this may be a better solution in the end. Using power of two growth seems to make for a better trade-off in most modern HashMap implementations anyway. I will likely test this soon.
Java apparently applies a secondary hash function to the given hash, to increase quality (https://stackoverflow.com/a/15437377/730797). We could always use that for unknown hash functions (and not do it for known, high quality hash functions).

https://agner.org/optimize/optimizing_cpp.pdf p. 149 ↩
https://agner.org/optimize/optimizing_cpp.pdf p. 30 ↩ ↩²
https://agner.org/optimize/optimizing_cpp.pdf p. 43 ↩

…fastmod`).

DeeJayLSP · 2025-05-20T03:26:22Z

GDScript has a heavy use of HashMap lookups, so I tested with GaijinEntertainment/godot-das@master/examples/bunnymark

At 1000 bunnies (because 10000 is very unlikely and in most cases deceiving with a result that only comes with that much) ran side-by-side, the max average FPS in one minute:

Branch	FPS
Reverted PR	1307
PR	1320

lawnjelly · 2025-05-27T06:21:21Z

if branches can be dodgy as sometimes going the other way to branchless code can be faster. But it depends on hardware and the access patterns. In an ideal world we would check on non-desktop, e.g. low end mobile (possibly even web, just in case?).

In this case I asked grok 😁 and it reckons it should be faster on desktop and mobile.

BTW I do wonder if there will be branch prediction, if this is not called in a tight loop (I don't know the access patterns, and I'm no expert at branch prediction).

Ivorforce · 2025-05-27T10:01:12Z

if branches can be dodgy as sometimes going the other way to branchless code can be faster. But it depends on hardware and the access patterns. In an ideal world we would check on non-desktop, e.g. low end mobile (possibly even web, just in case?).

Yes, this definitely feels like a micro optimization - it's hard to intuit that the new code is faster than the old one. I would not have proposed this change if it didn't lead to such major performance differences for lookup.
On the other hand, branch prediction has been ubiquitously important for CPUs for decades, so we can probably expect if to stay extremely fast on consumer hardware.

BTW I do wonder if there will be branch prediction, if this is not called in a tight loop (I don't know the access patterns, and I'm no expert at branch prediction).

Yes, branch prediction happens on every branch (even including short-circuiting && and ||). As far as I've seen, you can normally assume that the branch that is entered most often is the one that will be predicted.

Repiteo · 2025-05-27T14:51:16Z

Thanks!

Use if based mod in HashMap and HashSet in loops (faster than `…

6fe17b2

…fastmod`).

Ivorforce added this to the 4.5 milestone May 18, 2025

Ivorforce requested a review from a team as a code owner May 18, 2025 18:05

Ivorforce added enhancement topic:core performance labels May 18, 2025

Ivorforce mentioned this pull request May 18, 2025

[Core] Optimize HashMap #90082

Open

Ivorforce requested a review from lawnjelly May 26, 2025 23:57

lawnjelly approved these changes May 27, 2025

View reviewed changes

Repiteo merged commit 0c12e75 into godotengine:master May 27, 2025
20 checks passed

Ivorforce deleted the hashmap-if-mod branch May 27, 2025 14:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Accelerate `HashMap` and `HashSet` lookup by using `if` based modulo in loops #106569

Accelerate `HashMap` and `HashSet` lookup by using `if` based modulo in loops #106569

Ivorforce commented May 18, 2025 •

edited

Loading

Uh oh!

DeeJayLSP commented May 20, 2025 •

edited

Loading

Uh oh!

lawnjelly commented May 27, 2025 •

edited

Loading

Uh oh!

Ivorforce commented May 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Repiteo commented May 27, 2025

Uh oh!

Uh oh!

Uh oh!

Accelerate HashMap and HashSet lookup by using if based modulo in loops #106569

Accelerate HashMap and HashSet lookup by using if based modulo in loops #106569

Conversation

Ivorforce commented May 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Explanation

Benchmarks

Alternatives

Footnotes

Uh oh!

DeeJayLSP commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lawnjelly commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ivorforce commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Repiteo commented May 27, 2025

Uh oh!

Uh oh!

Accelerate `HashMap` and `HashSet` lookup by using `if` based modulo in loops #106569

Accelerate `HashMap` and `HashSet` lookup by using `if` based modulo in loops #106569

Ivorforce commented May 18, 2025 •

edited

Loading

DeeJayLSP commented May 20, 2025 •

edited

Loading

lawnjelly commented May 27, 2025 •

edited

Loading

Ivorforce commented May 27, 2025 •

edited

Loading