Skip to content

Commit e2d8506

Browse files
committed
Bump io_lib revision to 1.14.12
Due to removal of itf8 and ltf8 public CRAM interfaces (likely never used externally, but cannot be 100% certain) I've also bumped library ABI. Also included more benchmarks in the README file.
1 parent dccb9e5 commit e2d8506

File tree

3 files changed

+168
-34
lines changed

3 files changed

+168
-34
lines changed

CHANGES

+73-6
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
1-
Version ??? (ongoing)
2-
---------------------
1+
Version 1.14.12 (30th January 2020)
2+
---------------
3+
4+
This primarily has updates for CRAM 3.1 / 4.0. Note these are
5+
*incompatible* with the files produced by 1.14.11. (That warning was
6+
for a reason, and there is still potential for more to change.)
37

48
Updates:
59

@@ -11,6 +15,15 @@ Updates:
1115
codecs and uses only 1000 sequences per slice. Here fast implies
1216
fast random access as well as fast(er) cpu time.
1317

18+
* NM and MD tags are now checked during encode to validate that they
19+
match the decode algorithm. If not they are automatically stored
20+
verbatim.
21+
22+
* CRAM can now auto-disable the multi-ref mode if it realises we're no
23+
longer flip-flopping between many small references. This can
24+
improve compression in some situations as it also reenables the
25+
AP_delta flag.
26+
1427
* INCOMPATIBILITY: The CRAM fqzcomp quality codec has been updated for
1528
the experimental cram versions (3.1 & 4.0). This cannot read older
1629
fqzcomp files (and vice versa).
@@ -22,6 +35,9 @@ Updates:
2235
CRAMv4 at compression level -7 and above has the maximal form
2336
fqzcomp encoding.
2437

38+
* INCOMPATIBILITY: Renumbered the CRAM 3.1/4.0 codecs to sequentially
39+
follow the 3.0 ones.
40+
2541
* EXPERIMENTAL: Scramble -E can embed a consensus instead of reference
2642
and delta against that. It is not recommended that you use this yet
2743
though until the implications are sorted out. (Likely this will need
@@ -30,11 +46,43 @@ Updates:
3046
does not match the md5sum for the reference listed in the @SQ
3147
headers.
3248

33-
* Lots more minor updates to CRAM 3.1 compression codecs.
49+
* Lots more minor updates to CRAM 3.1 / 4.0 compression codecs.
50+
These have now also been moved to the new htscodecs submodule.
51+
See that logs in that git repository for full details of codec
52+
changes.
3453

35-
* NM and MD tags are now checked during encode to validate that they
36-
match the decode algorithm. If not they are automatically stored
37-
verbatim.
54+
* CRAM 4.0 format improvements:
55+
- New variable sized integer encoding.
56+
57+
- New "QO" quality orientation header field to optionally permit
58+
compression of quality strings in their as-sequenced orientation
59+
instead of as-aligned.
60+
61+
- Read names can now be deduped for read-pairs, just as we do for
62+
RNEXT, PNEXT and TLEN.
63+
64+
- CF has a new flag EXPLICIT_TLEN which permits encoding of TLEN
65+
only, but not RNEXT/PNEXT. Useful for preserving off-by-one TLEN
66+
sizes. (Usually insignificant, but on some "wrong" data sets it's
67+
up to 5% space saving.)
68+
69+
- MD, NM and RG can be stored in the TD map as placeholders.
70+
They're auto-computed still, but we now know if they existed and
71+
if so where in the tag list.
72+
73+
- Improved 64-bit position support.
74+
75+
- Added data tranforms for RLE, bit-PACKing and mapping and DELTA.
76+
These are analogous to the rANS4x16 codec, but may be used in
77+
conjunction with other codecs. (Currently sparsely utilised by
78+
the encoder.)
79+
80+
- Native upport for signed data types, instead of assuming
81+
0xffffffff is -1 (for example). Used for AP, TS and RG.
82+
83+
* Improved build instructions: fixes github #19
84+
85+
* Tidied up EOF writing code to be more CRAM version agnostic.
3886

3987
Bug fixes:
4088

@@ -58,6 +106,25 @@ Bug fixes:
58106

59107
* Fixed compilation error on x32 architecture.
60108

109+
* Fixed LDFLAGS typo causing --with-zlib to overrule the users
110+
definition of LDFLAGS.
111+
112+
* Fixed memory leaks in the test harness.
113+
114+
* Fixed cram_filter when used in conjunction with "scramble -n" (no
115+
names).
116+
117+
* Fixed some rare thread race conditions in CRAM encoding.
118+
119+
* Fixed an optimisation buglet in gcc 5.0 to 5.4. Fixes github #17
120+
121+
* Various compiler warnings silenced (some of which were minor bug
122+
fixes too).
123+
124+
* Fixed program name in help message from scram_test and
125+
srf_extract_hash.
126+
127+
* Fixed type overflow problems with itf8 macros. Fixes githjub #22.
61128

62129
Version 1.14.11 (16th October 2018)
63130
---------------

README.md

+92-25
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
Io_lib: Version 1.14.11
1+
Io_lib: Version 1.14.12
22
========================
33

44
Io_lib is a library of file reading and writing code to provide a general
@@ -33,19 +33,35 @@ See the CHANGES for a summary of older updates or git logs for the
3333
full details.
3434

3535

36-
This branch (as of 13th May 2019)
37-
-----------
36+
Version 1.14.12 (30th January 2020)
37+
---------------
38+
39+
This is primarily a change to CRAM, focusing mainly on the unofficial
40+
CRAM 3.1 and 4.0 file formats. Note these newer experimental formats
41+
are INCOMPATIBLE with the 1.14.11 output!
42+
43+
Some changes also affect CRAM 3.0 (current) though. Main updates are:
44+
45+
* Added compression profiles to scramble: fast, normal (default),
46+
small and archive. Specify using scramble -X profile-name. These
47+
change compression codecs permitted as well as the granularity of
48+
random access ("fast" profile is 1/10th the size per block than
49+
normal).
50+
51+
* NM and MD tags are now checked during encode to validate
52+
auto-generation during decode. If they differ they are stored
53+
verbatim.
54+
55+
* CRAM behaves better when many small chromosomes occur in the middle
56+
of larger ones (as it can switch out of multi-ref mode again).
57+
58+
* Numerous improvements to CRAM 4.0 compression ratios.
3859

39-
* CRAM: Added compression profiles to scramble. Specify with -X
40-
profile where "profile" is one of fast, normal (default), small or
41-
archive.
60+
* Some speed improvements to CRAM 3.1 and 4.0 decoding.
4261

43-
* Improved CRAM v3.1/4.0 codec compression ratios and speed. See below
44-
for a small benchmark.
62+
* Fixes to github issues/bugs #12, #14-15, #17-22.
4563

46-
* CRAM (EXPERIMENTAL): scramble -E permits use of a consensus as the
47-
embedded reference instead of real reference. Note this breaks some
48-
CRAM decoders, so will probably be reserved for CRAM v4.0.
64+
See CHANGES for more details.
4965

5066

5167
Version 1.14.11 (16th October 2018)
@@ -76,7 +92,9 @@ The current official GA4GH CRAM version is 3.0.
7692
For purposes of *EVALUATION ONLY* this release of io_lib includes CRAM
7793
version 3.1, with new compression codecs (but is otherwise identical
7894
file layout to 3.0), and 4.0 with a few additional format
79-
modifications, such as 64-bit sizes.
95+
modifications, such as 64-bit sizes, deduplication of read names,
96+
orientation changes of quality strings and a revised variable sized
97+
integer encoding.
8098

8199
They can be turned on using e.g. scramble -V3.1 or scramble -V4.0.
82100
It is likely CRAM v4.0 will be official significantly later, but we
@@ -98,22 +116,71 @@ on an Intel i5-4570 processor at 3.2GHz.
98116
|Scramble opts. |Size(MB) |Enc(s)|Dec(s)|Codecs used |
99117
|--------------------|--------:|-----:|-----:|---------------------------|
100118
|-O bam | 531.9| 92.3| 7.5|bgzf(zlib) |
101-
|-O bam | 539.5| 48.5| 3.7|bgzf(libdeflate) |
119+
|-O bam -1 | 611.4| 26.4| 5.4|bgzf(libdeflate) |
120+
|-O bam (default) | 539.5| 45.0| 4.9|bgzf(libdeflate) |
121+
|-O bam -9 | 499.5| 920.2| 4.9|bgzf(libdeflate) |
102122
||||||
103-
|-V2.0 | 257.0| 43.5| 10.9|(default) |
104-
|-V2.0 -X fast | 302.6| 37.0| 12.1|(default, level 1) |
105-
|-V2.0 -X small | 216.3| 126.9| 31.2|bzip2 |
123+
|-V2.0 -X fast | 302.6| 33.5| 12.7|(default, level 1) |
124+
|-V2.0 (default) | 257.0| 39.7| 11.5|(default) |
125+
|-V2.0 -X small | 216.3| 123.8| 32.0|bzip2 |
106126
||||||
107-
|-V3.0 | 223.7| 39.9| 9.8|(default) |
108-
|-V3.0 -X fast | 274.0| 35.6| 10.6|(default, level 1) |
109-
|-V3.0 -X small | 212.2| 94.3| 18.0|bzip2 |
110-
|-V3.0 -X archive | 209.3| 106.6| 17.6|bzip2, lzma |
127+
|-V3.0 -X fast | 274.0| 30.8| 11.0|(default, level 1) |
128+
|-V3.0 (default) | 223.7| 36.7| 10.4|(default) |
129+
|-V3.0 -X small | 212.2| 90.3| 18.2|bzip2 |
130+
|-V3.0 -X archive | 209.3| 103.5| 18.2|bzip2, lzma |
111131
||||||
112-
|-V3.1 | 186.5| 38.3| 8.9|rANS++,tok3 |
113-
|-V3.1 -X fast | 282.7| 29.5| 9.2|rANS++ |
114-
|-V3.1 -X small | 177.0| 78.7| 33.3|rANS++,tok3,fqz |
115-
|-V3.1 -X archive | 172.1| 137.2| 34.9|rANS++,tok3,fqz,bzip2,arith|
116-
132+
|-V3.1 -X fast | 275.1| 28.6| 11.3|rANS++ |
133+
|-V3.1 (default) | 186.2| 36.4| 8.5|rANS++,tok3 |
134+
|-V3.1 -X small | 176.8| 77.9| 34.9|rANS++,tok3,fqz |
135+
|-V3.1 -X archive | 172.0| 134.7| 34.0|rANS++,tok3,fqz,bzip2,arith|
136+
||||||
137+
|-V4.0 -X fast | 258.4| 29.9| 11.2|rANS++ |
138+
|-V4.0 (default) | 181.9| 34.3| 8.3|rANS++,tok3 |
139+
|-V4.0 -X small | 170.8| 74.7| 34.4|rANS++,tok3,fqz |
140+
|-V4.0 -X archive | 166.8| 122.0| 33.7|rANS++,tok3,fqz,bzip2,arith|
141+
142+
We also tested on a small human aligned HiSeq run (ERR317482)
143+
representing older Illumina data with pre-binning era quality values.
144+
This dataset shows less impressive gains with 4.0 over 3.0 in the
145+
default profile, but major gains in small profile once fqzcomp quality
146+
encoding is enabled.
147+
148+
Note for this file, the file sizes are larger meaning less disk
149+
caching is possible (the test machine wasn't a memory stressed
150+
desktop). Threading was also enabled, albeit with just 4 threads,
151+
which further exacerbates I/O bottlenecks. The previous test
152+
demonstrated BAM being faster to read than CRAM, but with large files
153+
in a more I/O stressed situation this test demonstrates the default
154+
profile of CRAM is faster to read than BAM, due to the smaller I/O
155+
footprint.
156+
157+
|Scramble opts. |Size(MB) |Enc(s)|Dec(s)|Codecs used |
158+
|-------------------- |--------:|-----:|-----:|--------------------------------|
159+
|-t4 -O bam (default) | 6526 | 115.4| 44.7|bgzf(libdeflate) |
160+
||||||
161+
|-t4 -V2.0 -X fast | 3674 | 87.4| 31.4|(default, level 1) |
162+
|-t4 -V2.0 (default) | 3435 | 91.4| 30.7|(default) |
163+
|-t4 -V2.0 -X small | 3373 | 145.5| 47.8|bzip2 |
164+
|-t4 -V2.0 -X archive | 3377 | 166.3| 49.7|bzip2 |
165+
|-t4 -V2.0 -X archive -9| 3125 |1900.6| 76.9|bzip2 |
166+
||||||
167+
|-t4 -V3.0 -X fast | 3620 | 88.3| 29.3|(default, level 1) |
168+
|-t4 -V3.0 (default) | 3287 | 90.5| 29.5|(default) |
169+
|-t4 -V3.0 -X small | 3238 | 128.5| 40.3|bzip2 |
170+
|-t4 -V3.0 -X archive | 3220 | 164.9| 50.0|bzip2, lzma |
171+
|-t4 -V3.0 -X archive -9| 3115 |1866.6| 75.2|bzip2, lzma |
172+
||||||
173+
|-t4 -V3.1 -X fast | 3611 | 87.9| 29.2|rANS++ |
174+
|-t4 -V3.1 (default) | 3161 | 88.8| 29.7|rANS++,tok3 |
175+
|-t4 -V3.1 -X small | 2249 | 192.2| 146.1|rANS++,tok3,fqz |
176+
|-t4 -V3.1 -X archive | 2157 | 235.2| 127.5|rANS++,tok3,fqz,bzip2,arith |
177+
|-t4 -V3.1 -X archive | 2145 | 480.3| 128.9|rANS++,tok3,fqz,bzip2,arith,lzma|
178+
||||||
179+
|-t4 -V4.0 -X fast | 3551 | 87.8| 29.5|rANS++ |
180+
|-t4 -V4.0 (default) | 3148 | 88.9| 30.0|rANS++,tok3 |
181+
|-t4 -V4.0 -X small | 2236 | 189.7| 142.6|rANS++,tok3,fqz |
182+
|-t4 -V4.0 -X archive | 2139 | 226.7| 127.5|rANS++,tok3,fqz,bzip2,arith |
183+
|-t4 -V4.0 -X archive -9| 2132 | 453.5| 128.2|rANS++,tok3,fqz,bzip2,arith,lzma|
117184

118185

119186
Building

configure.ac

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
dnl Process this file with autoconf to produce a configure script.
2-
AC_INIT(io_lib, 1.14.11)
2+
AC_INIT(io_lib, 1.14.12)
33
AC_CONFIG_HEADERS([io_lib_config.h])
44
AC_CONFIG_MACRO_DIR([m4])
55
AM_INIT_AUTOMAKE([serial-tests])
@@ -63,8 +63,8 @@ AX_SUBDIRS_CONFIGURE([htscodecs],[[--disable-shared],[--with-pic]])
6363
# libstaden-read.so.1 -> libstaden-read.so.1.1.0
6464
# libstaden-read.so.1.1.0
6565

66-
VERS_CURRENT=13
67-
VERS_REVISION=1
66+
VERS_CURRENT=14
67+
VERS_REVISION=0
6868
VERS_AGE=0
6969
AC_SUBST(VERS_CURRENT)
7070
AC_SUBST(VERS_REVISION)

0 commit comments

Comments
 (0)