Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify the name tokeniser uncomp_len calculation (PR #803) #803

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 12 additions & 5 deletions CRAMcodecs.tex
Original file line number Diff line number Diff line change
Expand Up @@ -2450,11 +2450,18 @@ \section{Name tokenisation codec}
a format within a format, as the multiple byte streams $B_{pos,type}$
are serialised into a single byte stream.

The serialised data stream starts with two unsigned little endiand 32-bit
integers holding the total size of uncompressed name buffer and the
number of read names. This is followed the array elements
themselves.

The serialised data stream starts with two unsigned little endian
32-bit integers holding the total size of uncompressed name buffer and
the number of read names, and a flag byte indicating whether data is
compressed with arithmetic coding or rANS Nx16.
Note the uncompressed size is calculated as the sum of
all name lengths including a termination byte per name (e.g. the nul
char). This is irrespective of whether the implementation produces
data in this form or whether it returns separate name and name-length
arrays.

This is then followed by serialised data and meta-data for each token
stream.
Token types, $ttype$ holds one of the token ID values listed above
in the list above, plus special values to indicate certain additional
flags. Bit 6 (64) set indicates that this entire token data stream is a
Expand Down