Skip to content

Accelerate HashMap and HashSet lookup by using if based modulo in loops #106569

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 27, 2025

Conversation

Ivorforce
Copy link
Member

@Ivorforce Ivorforce commented May 18, 2025

HashMap uses modulo to map hashes and positions to the 0 to capacity range.
Modulo is slow, so it was exchanged for fastmod in #62327.
However, fastmod is still a bottleneck for HashMap lookup. I was able to accelerate calls by using an if pseudo modulo instead, which is faster than fastmod.

This change can accelerate HashMap lookup by 40%.

Explanation

On my machine, fastmod uses 2 multiplications (3-10 clock cycles1 each) and a bitshift (1 clock cycle2).

An if is an integer comparison (1 clock cycle2) and a mostly predictable branch (0-2 clock cycles3 normally, 12 - 25 clock cycles on modulo case).

It stands to reason that in most cases, an if based modulo will be faster than fastmod. It can be used when we're sure the number is between (0, capacity * 2 - 1).

Benchmarks

Theory is well and good, but benchmarks are better.
I measured an up to 40% improvement in HashMap get, 11% for insert, and 2.5% for erase.
This will be the case notably mostly when the memory is in cache. When it is not in cache, the bottleneck is likely to be RAM access.

Test Code
	for (int size : { 1, 2, 8, 64, 1024, 4096}) {
		{
			size_t time_ns = 0;
			for (int run = 0; run < 20000000 / size; run++) {
				auto t0 = std::chrono::high_resolution_clock::now();
				HashMap<int64_t, int64_t> dictionary;
				for (int idx = 0; idx < size; idx ++) {
					// Test
					dictionary.insert(idx, idx);
				}
				auto t1 = std::chrono::high_resolution_clock::now();
				time_ns += std::chrono::duration_cast<std::chrono::nanoseconds>(t1 - t0).count();
			}

			std::cout << "insert:" << size << std::endl;
			std::cout << time_ns / 1000 / 1000 << "ms\n";
		}
		{
			size_t time_ns = 0;
			for (int run = 0; run < 20000000 / size; run++) {
				HashMap<int64_t, int64_t> dictionary;
				for (int idx = 0; idx < size; idx ++) {
					// Test
					dictionary.insert(idx, idx);
				}
				auto t0 = std::chrono::high_resolution_clock::now();
				for (int idx = 0; idx < size; idx ++) {
					// Test
					dictionary.erase(idx);
				}
				auto t1 = std::chrono::high_resolution_clock::now();
				time_ns += std::chrono::duration_cast<std::chrono::nanoseconds>(t1 - t0).count();
			}

			std::cout << "erase:" << size << std::endl;
			std::cout << time_ns / 1000 / 1000 << "ms\n";
		}
		{
			size_t time_ns = 0;
			for (int run = 0; run < 20000000 / size; run++) {
				HashMap<int64_t, int64_t> dictionary;
				for (int idx = 0; idx < size; idx ++) {
					// Test
					dictionary.insert(idx, idx);
				}
				size_t total = 0;
				auto t0 = std::chrono::high_resolution_clock::now();
				for (int idx = 0; idx < size; idx ++) {
					// Test
					total += dictionary.get(idx);
				}
				auto t1 = std::chrono::high_resolution_clock::now();
				time_ns += std::chrono::duration_cast<std::chrono::nanoseconds>(t1 - t0).count();

				// Prevent compiling this out.
				volatile_size_t = total;
			}

			std::cout << "get:" << size << std::endl;
			std::cout << time_ns / 1000 / 1000 << "ms\n";
		}
	}

On master:

insert:1
912ms
erase:1
522ms
get:1
330ms
insert:2
552ms
erase:2
383ms
get:2
174ms
insert:8
310ms
erase:8
376ms
get:8
86ms
insert:64
406ms
erase:64
462ms
get:64
98ms
insert:1024
511ms
erase:1024
587ms
get:1024
105ms
insert:4096
861ms
erase:4096
643ms
get:4096
272ms

On this PR:

insert:1
913ms
erase:1
521ms
get:1
321ms
insert:2
544ms
erase:2
376ms
get:2
164ms
insert:8
292ms
erase:8
371ms
get:8
61ms
insert:64
376ms
erase:64
459ms
get:64
65ms
insert:1024
475ms
erase:1024
565ms
get:1024
64ms
insert:4096
769ms
erase:4096
627ms
get:4096
190ms

Alternatives

@Nazarwadim has proposed in #90082 to switch to & for modulus. This would make the operation near instantaneous.

However, we currently use prime based capacities, because those decrease collision counts when a mediocre hash strategy is chosen. There is definitely a trade off to be measured some time, but it should be measured with both strategies at an ideal implementation.

Since we know our hash functions are high quality, this may be a better solution in the end. Using power of two growth seems to make for a better trade-off in most modern HashMap implementations anyway. I will likely test this soon.
Java apparently applies a secondary hash function to the given hash, to increase quality (https://stackoverflow.com/a/15437377/730797). We could always use that for unknown hash functions (and not do it for known, high quality hash functions).

Footnotes

  1. https://agner.org/optimize/optimizing_cpp.pdf p. 149

  2. https://agner.org/optimize/optimizing_cpp.pdf p. 30 2

  3. https://agner.org/optimize/optimizing_cpp.pdf p. 43

@DeeJayLSP
Copy link
Contributor

DeeJayLSP commented May 20, 2025

GDScript has a heavy use of HashMap lookups, so I tested with GaijinEntertainment/godot-das@master/examples/bunnymark

At 1000 bunnies (because 10000 is very unlikely and in most cases deceiving with a result that only comes with that much) ran side-by-side, the max average FPS in one minute:

Branch FPS
Reverted PR 1307
PR 1320

@Ivorforce Ivorforce requested a review from lawnjelly May 26, 2025 23:57
@lawnjelly
Copy link
Member

lawnjelly commented May 27, 2025

if branches can be dodgy as sometimes going the other way to branchless code can be faster. But it depends on hardware and the access patterns. In an ideal world we would check on non-desktop, e.g. low end mobile (possibly even web, just in case?).

In this case I asked grok 😁 and it reckons it should be faster on desktop and mobile.

BTW I do wonder if there will be branch prediction, if this is not called in a tight loop (I don't know the access patterns, and I'm no expert at branch prediction).

@Ivorforce
Copy link
Member Author

Ivorforce commented May 27, 2025

if branches can be dodgy as sometimes going the other way to branchless code can be faster. But it depends on hardware and the access patterns. In an ideal world we would check on non-desktop, e.g. low end mobile (possibly even web, just in case?).

Yes, this definitely feels like a micro optimization - it's hard to intuit that the new code is faster than the old one. I would not have proposed this change if it didn't lead to such major performance differences for lookup.
On the other hand, branch prediction has been ubiquitously important for CPUs for decades, so we can probably expect if to stay extremely fast on consumer hardware.

BTW I do wonder if there will be branch prediction, if this is not called in a tight loop (I don't know the access patterns, and I'm no expert at branch prediction).

Yes, branch prediction happens on every branch (even including short-circuiting && and ||). As far as I've seen, you can normally assume that the branch that is entered most often is the one that will be predicted.

@Repiteo Repiteo merged commit 0c12e75 into godotengine:master May 27, 2025
20 checks passed
@Repiteo
Copy link
Contributor

Repiteo commented May 27, 2025

Thanks!

@Ivorforce Ivorforce deleted the hashmap-if-mod branch May 27, 2025 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants