Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize BSON decoding #1667

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open

Conversation

vbabanin
Copy link
Member

@vbabanin vbabanin commented Mar 31, 2025

Summary

This PR improves BSON decoding performance by addressing common sources of inefficiency in string deserialization and buffer handling:

Key changes

  • Avoiding redundant byte buffer copies: We relax the requirement for read-only buffers and instead pass a backing array view into the String constructor for UTF-8 decoding. While this makes the buffer mutable, it avoids costly intermediate array allocations and memory copies.

  • Scratch buffer reuse: For direct buffers (which don't expose a backing array), we now reuse a preallocated scratch array rather than allocating a fresh byte[] on every decode. This reduces allocation time and GC pressure.

  • Bulk scanning for null terminators: When searching for C-style null terminators in UTF-8 decoding, we now scan in 8-byte chunks on platforms that support unaligned memory access.

perf task:

Test Base Mainline Mean (MB/s) Patched Mean (MB/s) Diff (%) Z-Score
Deep BSON Decoding 94.1 111.3 +18.2% 3.71
Find & Empty Cursor 138.6 164.0 +18.4% 6.50
Flat BSON Decoding 313.6 394.6 +25.8% 6.59
Full BSON Decoding 244.2 292.4 +19.8% 4.91

Perf analyzer results: Link

perf-netty task:

Test Base Mainline Mean (MB/s) Patched Mean (MB/s) Diff (%) Z-Score
Find many & Empty the cursor 62.9 89.1 +41.6% 15.3

Perf analyzer results: Link

Primary review: @NathanQingyangXu

JAVA-5842

- Removed redundant String copying when possible, replacing with view-based access
- Added faster null-termination lookups to improve decoding performance
@vbabanin vbabanin self-assigned this Mar 31, 2025
@vbabanin vbabanin marked this pull request as ready for review April 3, 2025 07:59
@jyemin jyemin requested review from jyemin and removed request for rozza April 3, 2025 17:38
int pos = buffer.position();
int limit = buffer.limit();

if (UNALIGNED_ACCESS_SUPPORTED) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we did this for all platforms, it would be slower on the ones that don't allow unaligned access? Slower than just going byte by byte? Just wondering if it's worth it to have two code paths to maintain.

I also don't see a test for when this value is false, since we don't run on any platforms that would make it so. It's a bit concerning that we don't, even though by inspection it seems obvious, at least with the code as it is, that it's correct. If we did want to add a test, we would have to add a testing backdoor to PlatformUtil to override the default behavior of examining "os.arch"

Copy link
Contributor

@NathanQingyangXu NathanQingyangXu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I am ideal reviewer for this PR, so I could only focus on minor issues.
Feel free to treat Jeff as primary reviewer and me as secondary one.

// Found the null at pos + offset; reset buffer's position.
return (pos - prevPos) + offset + 1;
}
pos += 8;
Copy link
Contributor

@NathanQingyangXu NathanQingyangXu Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the above bitwise magic is cool; but it would be also great to leave some reference to SWAR (like https://en.wikipedia.org/wiki/SWAR) so future coder could know how to understand and maintain it (I assumed that if he had not known of SWAR, the above code comments won't be super helpful as well.

Copy link
Contributor

@NathanQingyangXu NathanQingyangXu Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure whether it is an idea to present some proof that the above bitwise magic does help perf significantly so the tradeoff between perf and readability is an easy decision (Or you have provided it in the description of the PR?).
For instance, two following metrics could be juxtaposed:

  1. baseline or current main branch metrics (without any optimization)
  2. metrics when only the above SWAR trick applied

Then it would be more convincing that the bitwise magic is really justified.

Copy link
Contributor

@NathanQingyangXu NathanQingyangXu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finished my complete review. Looks good to me. I am especially impressed by the testing completeness in ByteBufferBsonInputTest.

Left some minor comments, but none of which is block of LGTM.

As long as the decision on unaligned memory access is finalized, I can approve this PR.

Copy link
Contributor

@NathanQingyangXu NathanQingyangXu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my side, this PR LGTM unless we want to switch to unconditional SWAR usage (dropping PlatformUtil, maybe after some perf testing on some rare CPU not supporting unaligned memory access).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants