Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CanonicalCode::read_symbol()'s peeking doesn't work with UnsatisfiableReadBehavior::Reject #25005

Open
nico opened this issue Sep 9, 2024 · 6 comments

Comments

@nico
Copy link
Contributor

nico commented Sep 9, 2024

Consider the test case added in https://github.com/SerenityOS/serenity/compare/master...nico:serenity:deflate-peek?expand=1

It adds a canonical code that uses 1 bit for symbol 0, 2 bits for symbol 1, 3 bits for symbol 2, and so on, and then 15 bits each for symbols 14 and 15.

(15 is the longest allowed code length in deflate.)

It then uses a bit stream that contains one 0, one 10, one 110, and so on, to encode each symbol from 0 to 15 once. That's 15 * 16 / 2 + 15 == 135 bits in total. It's one bit short of a multiple of 8, so the bitstream also contains a single trailing 0 bit.

The bitstream decodes to the symbols 0, 1, 2, ..., 14, 15 – and then the final trailing 0 bit should decode to 0 again.

But instead the final read results in:

 (cd Tests/LibCompress; ../../Build/lagom/bin/TestDeflate canonical_code_lopsided)
Running 1 cases out of 14.
Running test 'canonical_code_lopsided'.
FAIL: /Users/thakis/src/serenity/Tests/LibCompress/TestDeflate.cpp:104: Reached end-of-stream without collecting the required number of bits

That's because CanonicalCode::read_symbol does a TRY(stream.peek_bits<size_t>(m_max_prefixed_code_length)); to look up 8 bits into the future, for the fast lookup added in #18075. However, #21890 made it a mistake to look up at more bits than remain in the stream – even when just peeking.

If I change LittleEndianInputBitStream's ctor to use UnsatisfiableReadBehavior::FillWithZero as default argument, the test starts passing.

I'm not sure what the nicest fix is:

  • Allow peek()ing more bits than available, and only error out when then calling discard_previously_peeked_bits with more bits than available?
  • Somehow expose how many bits are left, and only peek at most that many bits?

The first seems a bit nicer, probably.

@timschumi @trflynn89

nico added a commit to nico/serenity that referenced this issue Sep 9, 2024
...and then fail only when discard_previously_peeked_bits() more
bits than we have.

`unzip ~/Downloads/enwik8.zip`: 1.005s -> 1.050s
4.9% slowdown :/

Fixes SerenityOS#25005
@nico
Copy link
Contributor Author

nico commented Sep 9, 2024

The 2nd commit in https://github.com/SerenityOS/serenity/compare/master...nico:serenity:deflate-peek?expand=1 does the first idea now. It's a 5% slow-down, but in return it's correct :/

@nico
Copy link
Contributor Author

nico commented Sep 9, 2024

With that change, UnsatisfiableReadBehavior can be removed – turns out webp doesn't need any special handling here. Regular deflate just happened to work before since it the superfluous bits came for peeking came from the checksum that always follows the bitstream.

@timschumi
Copy link
Member

#22130 provides additional information about the same issue or a related/similar problem.

I'm not sure what the nicest fix is:

* Allow peek()ing more bits than available, and only error out when then calling `discard_previously_peeked_bits` with more bits than available?

* Somehow expose how many bits are left, and only peek at most that many bits?

The main problem I see with the first one is that something might start to act on the peeked bits already, which is probably to be avoided (even though I most likely wouldn't even catch that during a code review). Zero isn't always the do nothing/end case, as the fuzzer has successfully found.

Therefore, I'm currently more inclined towards the second option, but I'll give it some thought.

nico added a commit to nico/serenity that referenced this issue Sep 9, 2024
Symbols that need <= 8 bits hit a fast path as of SerenityOS#18075,
but the slow path has done a full binary search over all symbols
ever since this code was added in XXX.

Instead of doing a binary search over all codes for every single
bit read, this implements the Moffat-Turpin approach described at
https://www.hanshq.net/zip.html#huffdec, which only requires a
table read per bit.

    hyperfine 'Build/lagom/bin/unzip ~/Downloads/enwik8.zip'
    1.008 s ± 0.016 s  =>  957.7 ms ± 3.9 ms, 5% faster

Due to issue SerenityOS#25005, we can't peek the full 15 bits at once but
have to read them one-by-one. This makes the code look a bit
different than in the linked article.

I also tried not changing CanonicalCode::from_bytes() too much.
It does 16 passes over all symbols. I think it could do it in
a single pass instead. But that's for a future change.

No behavior change (other than slightly faster perf).
nico added a commit to nico/serenity that referenced this issue Sep 9, 2024
Symbols that need <= 8 bits hit a fast path as of SerenityOS#18075, but
the slow path has done a full binary search over all symbols
ever since this code was added in SerenityOS#2963. (SerenityOS#3405 even added a FIXME
for doing this, but SerenityOS#18075 removed it.)

Instead of doing a binary search over all codes for every single
bit read, this implements the Moffat-Turpin approach described at
https://www.hanshq.net/zip.html#huffdec, which only requires a
table read per bit.

    hyperfine 'Build/lagom/bin/unzip ~/Downloads/enwik8.zip'
    1.008 s ± 0.016 s  =>  957.7 ms ± 3.9 ms, 5% faster

Due to issue SerenityOS#25005, we can't peek the full 15 bits at once but
have to read them one-by-one. This makes the code look a bit
different than in the linked article.

I also tried not changing CanonicalCode::from_bytes() too much.
It does 16 passes over all symbols. I think it could do it in
a single pass instead. But that's for a future change.

No behavior change (other than slightly faster perf).
@nico
Copy link
Contributor Author

nico commented Sep 10, 2024

For fast deflate implementations, you normally do something like:

  • Peek 15 bits
  • Look into a table to see if the first ~8 bits are enough for at least one symbol, if so ask that table how many bits were actually read
  • Else, look in a different table and figure out how many bits make up the next symbol
  • Discard that many bits

The first, fast lookup table stores each entry many times: For example, if one symbol is six bits long, it it stores 4 entries, each consisting of the 6 bits the symbol takes up, combined with all 4 possible two-bit patterns to make up 8 bits (assuming the fast table has 8-bit entries. Ours does, cf max_allowed_prefixed_code_length.) So you do read more bits than are present, but the value of the extra bits doesn't matter – they don't have to be 0. But you only learn the actual symbol bit length after looking up the symbol in the table.

If your bit stream keeps 64 bits in a word, then the peek is just a shift and mask, and that and the table lookup are your hot loop. The second option means adding conditionals and a bunch of other work in the hot loop, so I'm a bit worried about perf.

(My local patch for the first option however also adds a check when accepting a symbol and slows things down. Maybe we should just always pass a slice that has some trailing padding data as deflate input, so that we never run into this. Often, that's easy to do, but e.g. webp's VP8L basic lossless format has no naturally occurring trailing data.)

nico added a commit to nico/serenity that referenced this issue Sep 10, 2024
Symbols that need <= 8 bits hit a fast path as of SerenityOS#18075, but
the slow path has done a full binary search over all symbols
ever since this code was added in SerenityOS#2963. (SerenityOS#3405 even added a FIXME
for doing this, but SerenityOS#18075 removed it.)

Instead of doing a binary search over all codes for every single
bit read, this implements the Moffat-Turpin approach described at
https://www.hanshq.net/zip.html#huffdec, which only requires a
table read per bit.

    hyperfine 'Build/lagom/bin/unzip ~/Downloads/enwik8.zip'
    1.008 s ± 0.016 s  =>  957.7 ms ± 3.9 ms, 5% faster

Due to issue SerenityOS#25005, we can't peek the full 15 bits at once but
have to read them one-by-one. This makes the code look a bit
different than in the linked article.

I also tried not changing CanonicalCode::from_bytes() too much.
It does 15 passes over all symbols. I think it could do it in
a single pass instead. But that's for a future change.

No behavior change (other than slightly faster perf).
@nico
Copy link
Contributor Author

nico commented Sep 10, 2024

#24567 uses the name peek_bits_possibly_past_end(), which seems possibly a good way forward.

timschumi pushed a commit that referenced this issue Sep 14, 2024
Symbols that need <= 8 bits hit a fast path as of #18075, but
the slow path has done a full binary search over all symbols
ever since this code was added in #2963. (#3405 even added a FIXME
for doing this, but #18075 removed it.)

Instead of doing a binary search over all codes for every single
bit read, this implements the Moffat-Turpin approach described at
https://www.hanshq.net/zip.html#huffdec, which only requires a
table read per bit.

    hyperfine 'Build/lagom/bin/unzip ~/Downloads/enwik8.zip'
    1.008 s ± 0.016 s  =>  957.7 ms ± 3.9 ms, 5% faster

Due to issue #25005, we can't peek the full 15 bits at once but
have to read them one-by-one. This makes the code look a bit
different than in the linked article.

I also tried not changing CanonicalCode::from_bytes() too much.
It does 15 passes over all symbols. I think it could do it in
a single pass instead. But that's for a future change.

No behavior change (other than slightly faster perf).
@timschumi
Copy link
Member

#24567 uses the name peek_bits_possibly_past_end(), which seems possibly a good way forward.

Sounds like a good idea, let's do that. That's also less prone to error than filling in bits silently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants