CanonicalCode::read_symbol()'s peeking doesn't work with UnsatisfiableReadBehavior::Reject #25005

nico · 2024-09-09T13:57:23Z

Consider the test case added in https://github.com/SerenityOS/serenity/compare/master...nico:serenity:deflate-peek?expand=1

It adds a canonical code that uses 1 bit for symbol 0, 2 bits for symbol 1, 3 bits for symbol 2, and so on, and then 15 bits each for symbols 14 and 15.

(15 is the longest allowed code length in deflate.)

It then uses a bit stream that contains one 0, one 10, one 110, and so on, to encode each symbol from 0 to 15 once. That's 15 * 16 / 2 + 15 == 135 bits in total. It's one bit short of a multiple of 8, so the bitstream also contains a single trailing 0 bit.

The bitstream decodes to the symbols 0, 1, 2, ..., 14, 15 – and then the final trailing 0 bit should decode to 0 again.

But instead the final read results in:

 (cd Tests/LibCompress; ../../Build/lagom/bin/TestDeflate canonical_code_lopsided)
Running 1 cases out of 14.
Running test 'canonical_code_lopsided'.
FAIL: /Users/thakis/src/serenity/Tests/LibCompress/TestDeflate.cpp:104: Reached end-of-stream without collecting the required number of bits

That's because CanonicalCode::read_symbol does a TRY(stream.peek_bits<size_t>(m_max_prefixed_code_length)); to look up 8 bits into the future, for the fast lookup added in #18075. However, #21890 made it a mistake to look up at more bits than remain in the stream – even when just peeking.

If I change LittleEndianInputBitStream's ctor to use UnsatisfiableReadBehavior::FillWithZero as default argument, the test starts passing.

I'm not sure what the nicest fix is:

Allow peek()ing more bits than available, and only error out when then calling discard_previously_peeked_bits with more bits than available?
Somehow expose how many bits are left, and only peek at most that many bits?

The first seems a bit nicer, probably.

@timschumi @trflynn89

The text was updated successfully, but these errors were encountered:

...and then fail only when discard_previously_peeked_bits() more bits than we have. `unzip ~/Downloads/enwik8.zip`: 1.005s -> 1.050s 4.9% slowdown :/ Fixes SerenityOS#25005

nico · 2024-09-09T14:13:51Z

The 2nd commit in https://github.com/SerenityOS/serenity/compare/master...nico:serenity:deflate-peek?expand=1 does the first idea now. It's a 5% slow-down, but in return it's correct :/

nico · 2024-09-09T14:18:09Z

With that change, UnsatisfiableReadBehavior can be removed – turns out webp doesn't need any special handling here. Regular deflate just happened to work before since it the superfluous bits came for peeking came from the checksum that always follows the bitstream.

timschumi · 2024-09-09T21:33:07Z

#22130 provides additional information about the same issue or a related/similar problem.

I'm not sure what the nicest fix is:

* Allow peek()ing more bits than available, and only error out when then calling `discard_previously_peeked_bits` with more bits than available?

* Somehow expose how many bits are left, and only peek at most that many bits?

The main problem I see with the first one is that something might start to act on the peeked bits already, which is probably to be avoided (even though I most likely wouldn't even catch that during a code review). Zero isn't always the do nothing/end case, as the fuzzer has successfully found.

Therefore, I'm currently more inclined towards the second option, but I'll give it some thought.

Symbols that need <= 8 bits hit a fast path as of SerenityOS#18075, but the slow path has done a full binary search over all symbols ever since this code was added in XXX. Instead of doing a binary search over all codes for every single bit read, this implements the Moffat-Turpin approach described at https://www.hanshq.net/zip.html#huffdec, which only requires a table read per bit. hyperfine 'Build/lagom/bin/unzip ~/Downloads/enwik8.zip' 1.008 s ± 0.016 s => 957.7 ms ± 3.9 ms, 5% faster Due to issue SerenityOS#25005, we can't peek the full 15 bits at once but have to read them one-by-one. This makes the code look a bit different than in the linked article. I also tried not changing CanonicalCode::from_bytes() too much. It does 16 passes over all symbols. I think it could do it in a single pass instead. But that's for a future change. No behavior change (other than slightly faster perf).

Symbols that need <= 8 bits hit a fast path as of SerenityOS#18075, but the slow path has done a full binary search over all symbols ever since this code was added in SerenityOS#2963. (SerenityOS#3405 even added a FIXME for doing this, but SerenityOS#18075 removed it.) Instead of doing a binary search over all codes for every single bit read, this implements the Moffat-Turpin approach described at https://www.hanshq.net/zip.html#huffdec, which only requires a table read per bit. hyperfine 'Build/lagom/bin/unzip ~/Downloads/enwik8.zip' 1.008 s ± 0.016 s => 957.7 ms ± 3.9 ms, 5% faster Due to issue SerenityOS#25005, we can't peek the full 15 bits at once but have to read them one-by-one. This makes the code look a bit different than in the linked article. I also tried not changing CanonicalCode::from_bytes() too much. It does 16 passes over all symbols. I think it could do it in a single pass instead. But that's for a future change. No behavior change (other than slightly faster perf).

nico · 2024-09-10T00:17:06Z

For fast deflate implementations, you normally do something like:

Peek 15 bits
Look into a table to see if the first ~8 bits are enough for at least one symbol, if so ask that table how many bits were actually read
Else, look in a different table and figure out how many bits make up the next symbol
Discard that many bits

The first, fast lookup table stores each entry many times: For example, if one symbol is six bits long, it it stores 4 entries, each consisting of the 6 bits the symbol takes up, combined with all 4 possible two-bit patterns to make up 8 bits (assuming the fast table has 8-bit entries. Ours does, cf max_allowed_prefixed_code_length.) So you do read more bits than are present, but the value of the extra bits doesn't matter – they don't have to be 0. But you only learn the actual symbol bit length after looking up the symbol in the table.

If your bit stream keeps 64 bits in a word, then the peek is just a shift and mask, and that and the table lookup are your hot loop. The second option means adding conditionals and a bunch of other work in the hot loop, so I'm a bit worried about perf.

(My local patch for the first option however also adds a check when accepting a symbol and slows things down. Maybe we should just always pass a slice that has some trailing padding data as deflate input, so that we never run into this. Often, that's easy to do, but e.g. webp's VP8L basic lossless format has no naturally occurring trailing data.)

Symbols that need <= 8 bits hit a fast path as of SerenityOS#18075, but the slow path has done a full binary search over all symbols ever since this code was added in SerenityOS#2963. (SerenityOS#3405 even added a FIXME for doing this, but SerenityOS#18075 removed it.) Instead of doing a binary search over all codes for every single bit read, this implements the Moffat-Turpin approach described at https://www.hanshq.net/zip.html#huffdec, which only requires a table read per bit. hyperfine 'Build/lagom/bin/unzip ~/Downloads/enwik8.zip' 1.008 s ± 0.016 s => 957.7 ms ± 3.9 ms, 5% faster Due to issue SerenityOS#25005, we can't peek the full 15 bits at once but have to read them one-by-one. This makes the code look a bit different than in the linked article. I also tried not changing CanonicalCode::from_bytes() too much. It does 15 passes over all symbols. I think it could do it in a single pass instead. But that's for a future change. No behavior change (other than slightly faster perf).

nico · 2024-09-10T15:42:08Z

#24567 uses the name peek_bits_possibly_past_end(), which seems possibly a good way forward.

Symbols that need <= 8 bits hit a fast path as of #18075, but the slow path has done a full binary search over all symbols ever since this code was added in #2963. (#3405 even added a FIXME for doing this, but #18075 removed it.) Instead of doing a binary search over all codes for every single bit read, this implements the Moffat-Turpin approach described at https://www.hanshq.net/zip.html#huffdec, which only requires a table read per bit. hyperfine 'Build/lagom/bin/unzip ~/Downloads/enwik8.zip' 1.008 s ± 0.016 s => 957.7 ms ± 3.9 ms, 5% faster Due to issue #25005, we can't peek the full 15 bits at once but have to read them one-by-one. This makes the code look a bit different than in the linked article. I also tried not changing CanonicalCode::from_bytes() too much. It does 15 passes over all symbols. I think it could do it in a single pass instead. But that's for a future change. No behavior change (other than slightly faster perf).

timschumi · 2024-09-14T11:26:59Z

#24567 uses the name peek_bits_possibly_past_end(), which seems possibly a good way forward.

Sounds like a good idea, let's do that. That's also less prone to error than filling in bits silently.

nico mentioned this issue Sep 9, 2024

LibCompress: Speed up CanonicalCode::read_symbol() slow path #25008

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CanonicalCode::read_symbol()'s peeking doesn't work with UnsatisfiableReadBehavior::Reject #25005

CanonicalCode::read_symbol()'s peeking doesn't work with UnsatisfiableReadBehavior::Reject #25005

nico commented Sep 9, 2024

nico commented Sep 9, 2024

nico commented Sep 9, 2024

timschumi commented Sep 9, 2024

nico commented Sep 10, 2024

nico commented Sep 10, 2024

timschumi commented Sep 14, 2024

CanonicalCode::read_symbol()'s peeking doesn't work with UnsatisfiableReadBehavior::Reject #25005

CanonicalCode::read_symbol()'s peeking doesn't work with UnsatisfiableReadBehavior::Reject #25005

Comments

nico commented Sep 9, 2024

nico commented Sep 9, 2024

nico commented Sep 9, 2024

timschumi commented Sep 9, 2024

nico commented Sep 10, 2024

nico commented Sep 10, 2024

timschumi commented Sep 14, 2024