Skip to content

feat: use reader.copy_into_slice when deserializing char#187

Open
kskalski wants to merge 1 commit intoanza-xyz:masterfrom
kskalski:ks/char_fill_exact
Open

feat: use reader.copy_into_slice when deserializing char#187
kskalski wants to merge 1 commit intoanza-xyz:masterfrom
kskalski:ks/char_fill_exact

Conversation

@kskalski
Copy link
Contributor

Avoid calling fill_buf_exact during char deserialization such that Reader implementation isn't required to always provide borrowed slice of requested (2-4 byte) size.

Use a stack defined 4-byte array instead and fill it with requires (up to 4) bytes using copy_into_slice.

This removes the only call to fill_buf_exact in production code outside of default / forwarding implementations of Reader.

0xF0..=0xF4 => 4,
_ => return Err(invalid_char_lead(b0)),
};
let mut buf = [0u8; 4];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could use [MaybeUninit<u8>; 4] and assume_init_ref with rust >=1.93 (https://doc.rust-lang.org/stable/core/primitive.slice.html#method.assume_init_ref), but I doubt for 4-byte array it would make any difference

@kskalski kskalski marked this pull request as ready for review February 13, 2026 07:56
@kskalski kskalski requested a review from cpubot February 13, 2026 07:58
@cpubot
Copy link
Contributor

cpubot commented Feb 13, 2026

For small reads like this it's totally fine to use fill_* methods rather than copying directly.

if buffered_len >= n_bytes {
// SAFETY: `filled` always points to an initialized portion of the buffer.
return Ok(unsafe { from_raw_parts(buf.as_ptr().cast::<u8>().add(filled.start), n_bytes) });
}

The proposed implementation will short-circuit if the buffer already contains enough bytes to fulfill the request.

fill_buf or fill_exact doesn't always entail IO -- since we accept an n_bytes argument, we can satisfy the request without necessarily filling.

Our Reader API necessarily entails some kind of underlying buffer (given definitions of fill_buf, fill_exact, peek). In general I think the optimal strategy should be "fill wherever possible unless reads may be large" -- this amortizes IO over fewer total syscalls. This strategy is reflected in current implementations. For small reads (integers, char, IpAddr, etc), we potentially trigger buffering such that subsequent small reads don't actually make IO calls, while larger sequences like Vec<T> where T: ZeroCopy, String take the direct copy path.

Consider a case where one is deserializing a struct containing all integers and chars. If those were to all use copy_into_slice, they could totally circumvent buffering depending on the underlying IO Reader implementation, in which case an IO call would be made for each individual deserialization.

The proposed no-grow implementation will still actually call fill_buf when reads are less than buffer capacity

if needed >= capacity {

and this was done specifically to avoid the above case mentioned above. But, this is an implementation detail of the proposal, not something necessarily guaranteed by the API. By using the fill_* methods, we guarantee by definition that this case can't happen.

@kskalski
Copy link
Contributor Author

kskalski commented Feb 14, 2026

The practical problem is that even when operating on buffers, satisfying fill_exact and fill_array may be difficult and force the implementation to do memmove or worse do allocations. This is the case for agave reader that operates on chunks of larger buffer reading them in parallel - generally when you are reading a series of values you get to the end of the current chunk, say you are at position pos: chunk.len() - 3 and when you are asked for contiguous slice of memory of size 4, you can't just point to chunk[pos..pos + 4], since it goes out of the current chunk. Even if you have the rest of the data already in other chunk, you need to do extra defragmentation:

  • move chunk[pos..] to chunk[0..chunk.len() - pos]
  • move chunk.len() - pos data from next chunk into current one
  • the management of positions / consume after such op become more complicated (data is now in different part of buffer than usually, etc.)

I would argue that copy_to_slice is an "always cheaper" version of that, especially when user is going to copy out of returned slice anyway. Instead, they should copy from the position where the data is already in to destination memory. copy_to_slice is efficient for buffered reader, since it can easily avoid the unnecessary or short IOs, the crucial part is that it can populate target slice in 2 steps: first from the existing available data, then maybe after doing IO or switching to next chunk, the rest (possibly repeating the process several times).

It's fine to have fill_buf(n) API when expected return slice can be <n. However I think we should avoid APIs (or at least uses of it) where expectation is >=n.

Arguably the de-fragmentation cost mentioned above is not that bad if the expectation for returned slice can be bounded to n <= predictable_max. So fortunately getting rid of fill_exact is not a blocker. In case of this PR I think populating small piece of uninitialized stack is better than forcing reader to give access to contiguous memory (even in your reference impl we can avoid unnecessary memmove this way). I will do some comparisons of the assembly involved in real case, so we also know perf implications for current out of mem readers.

@cpubot
Copy link
Contributor

cpubot commented Feb 14, 2026

I would argue that copy_to_slice is an "always cheaper" version of that, especially when user is going to copy out of returned slice anyway.

The issue with a memcpy for small values is that it can prohibit scalarization because it is an opaque "copy some bytes" intrinsic. This is similar in motivation to #64. It gives the compiler more opportunity and more visibility to load rather than copy. That being said, since we inline so heavily, the compiler is likely to be able produce equivalent assembly, but we have seen explicit cases where it does not do so reliably (and this can be especially problematic on bpf targets).

I do agree with one of your points in #188, that because we employ this pattern so often:

let bytes: [u8; N] = *reader.fill_array();
unsafe { reader.consume_unchecked(N) };

There is likely room for an additional helper like your proposed read_to_array (or my suggested name, get_array).

I would certainly prefer this over using copy_into_slice everywhere on small loads for the reason mentioned above. Slice impls can implement get_array() with a dereference of fill_array() and avoid the memcpy (which I propose should be the default impl in the Reader trait), while still giving flexibility to other Reader implementations that want to do something different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants