From 7de0ae0d82fc9daa53d8258072aabe949cc419fa Mon Sep 17 00:00:00 2001
From: James Bonfield <jkb@sanger.ac.uk>
Date: Tue, 7 Jan 2025 14:42:33 +0000
Subject: [PATCH] Clarify the name tokeniser uncomp_len calculation (PR #803)

This includes all visible read name bytes plus 1 termination byte per
name (e.g. '\0').

Fixes #802

Also clarify the name tokeniser serialisation description.
Acknowledge the 1-byte "use_arith" field and replace the nebulous
"array elements" with a more descriptive text about token streams.
---
 CRAMcodecs.tex | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/CRAMcodecs.tex b/CRAMcodecs.tex
index e5852a32f..c55494fff 100644
--- a/CRAMcodecs.tex
+++ b/CRAMcodecs.tex
@@ -2450,11 +2450,18 @@ \section{Name tokenisation codec}
 a format within a format, as the multiple byte streams $B_{pos,type}$
 are serialised into a single byte stream.
 
-The serialised data stream starts with two unsigned little endiand 32-bit
-integers holding the total size of uncompressed name buffer and the
-number of read names.  This is followed the array elements
-themselves.
-
+The serialised data stream starts with two unsigned little endian
+32-bit integers holding the total size of uncompressed name buffer and
+the number of read names, and a flag byte indicating whether data is
+compressed with arithmetic coding or rANS Nx16.
+Note the uncompressed size is calculated as the sum of
+all name lengths including a termination byte per name (e.g. the nul
+char).  This is irrespective of whether the implementation produces
+data in this form or whether it returns separate name and name-length
+arrays.
+
+This is then followed by serialised data and meta-data for each token
+stream.
 Token types, $ttype$ holds one of the token ID values listed above
 in the list above, plus special values to indicate certain additional
 flags.  Bit 6 (64) set indicates that this entire token data stream is a