From 7de0ae0d82fc9daa53d8258072aabe949cc419fa Mon Sep 17 00:00:00 2001 From: James Bonfield Date: Tue, 7 Jan 2025 14:42:33 +0000 Subject: [PATCH] Clarify the name tokeniser uncomp_len calculation (PR #803) This includes all visible read name bytes plus 1 termination byte per name (e.g. '\0'). Fixes #802 Also clarify the name tokeniser serialisation description. Acknowledge the 1-byte "use_arith" field and replace the nebulous "array elements" with a more descriptive text about token streams. --- CRAMcodecs.tex | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/CRAMcodecs.tex b/CRAMcodecs.tex index e5852a32f..c55494fff 100644 --- a/CRAMcodecs.tex +++ b/CRAMcodecs.tex @@ -2450,11 +2450,18 @@ \section{Name tokenisation codec} a format within a format, as the multiple byte streams $B_{pos,type}$ are serialised into a single byte stream. -The serialised data stream starts with two unsigned little endiand 32-bit -integers holding the total size of uncompressed name buffer and the -number of read names. This is followed the array elements -themselves. - +The serialised data stream starts with two unsigned little endian +32-bit integers holding the total size of uncompressed name buffer and +the number of read names, and a flag byte indicating whether data is +compressed with arithmetic coding or rANS Nx16. +Note the uncompressed size is calculated as the sum of +all name lengths including a termination byte per name (e.g. the nul +char). This is irrespective of whether the implementation produces +data in this form or whether it returns separate name and name-length +arrays. + +This is then followed by serialised data and meta-data for each token +stream. Token types, $ttype$ holds one of the token ID values listed above in the list above, plus special values to indicate certain additional flags. Bit 6 (64) set indicates that this entire token data stream is a