1
- Io_lib: Version 1.14.11
1
+ Io_lib: Version 1.14.12
2
2
========================
3
3
4
4
Io_lib is a library of file reading and writing code to provide a general
@@ -33,19 +33,35 @@ See the CHANGES for a summary of older updates or git logs for the
33
33
full details.
34
34
35
35
36
- This branch (as of 13th May 2019)
37
- -----------
36
+ Version 1.14.12 (30th January 2020)
37
+ ---------------
38
+
39
+ This is primarily a change to CRAM, focusing mainly on the unofficial
40
+ CRAM 3.1 and 4.0 file formats. Note these newer experimental formats
41
+ are INCOMPATIBLE with the 1.14.11 output!
42
+
43
+ Some changes also affect CRAM 3.0 (current) though. Main updates are:
44
+
45
+ * Added compression profiles to scramble: fast, normal (default),
46
+ small and archive. Specify using scramble -X profile-name. These
47
+ change compression codecs permitted as well as the granularity of
48
+ random access ("fast" profile is 1/10th the size per block than
49
+ normal).
50
+
51
+ * NM and MD tags are now checked during encode to validate
52
+ auto-generation during decode. If they differ they are stored
53
+ verbatim.
54
+
55
+ * CRAM behaves better when many small chromosomes occur in the middle
56
+ of larger ones (as it can switch out of multi-ref mode again).
57
+
58
+ * Numerous improvements to CRAM 4.0 compression ratios.
38
59
39
- * CRAM: Added compression profiles to scramble. Specify with -X
40
- profile where "profile" is one of fast, normal (default), small or
41
- archive.
60
+ * Some speed improvements to CRAM 3.1 and 4.0 decoding.
42
61
43
- * Improved CRAM v3.1/4.0 codec compression ratios and speed. See below
44
- for a small benchmark.
62
+ * Fixes to github issues/bugs #12 , #14 -15, #17 -22.
45
63
46
- * CRAM (EXPERIMENTAL): scramble -E permits use of a consensus as the
47
- embedded reference instead of real reference. Note this breaks some
48
- CRAM decoders, so will probably be reserved for CRAM v4.0.
64
+ See CHANGES for more details.
49
65
50
66
51
67
Version 1.14.11 (16th October 2018)
@@ -76,7 +92,9 @@ The current official GA4GH CRAM version is 3.0.
76
92
For purposes of * EVALUATION ONLY* this release of io_lib includes CRAM
77
93
version 3.1, with new compression codecs (but is otherwise identical
78
94
file layout to 3.0), and 4.0 with a few additional format
79
- modifications, such as 64-bit sizes.
95
+ modifications, such as 64-bit sizes, deduplication of read names,
96
+ orientation changes of quality strings and a revised variable sized
97
+ integer encoding.
80
98
81
99
They can be turned on using e.g. scramble -V3.1 or scramble -V4.0.
82
100
It is likely CRAM v4.0 will be official significantly later, but we
@@ -98,22 +116,71 @@ on an Intel i5-4570 processor at 3.2GHz.
98
116
| Scramble opts. | Size(MB) | Enc(s)| Dec(s)| Codecs used |
99
117
| --------------------| --------:| -----:| -----:| ---------------------------|
100
118
| -O bam | 531.9| 92.3| 7.5| bgzf(zlib) |
101
- | -O bam | 539.5| 48.5| 3.7| bgzf(libdeflate) |
119
+ | -O bam -1 | 611.4| 26.4| 5.4| bgzf(libdeflate) |
120
+ | -O bam (default) | 539.5| 45.0| 4.9| bgzf(libdeflate) |
121
+ | -O bam -9 | 499.5| 920.2| 4.9| bgzf(libdeflate) |
102
122
||||||
103
- | -V2.0 | 257.0 | 43 .5| 10.9 | (default) |
104
- | -V2.0 -X fast | 302.6 | 37.0 | 12.1 | (default, level 1) |
105
- | -V2.0 -X small | 216.3| 126.9 | 31.2 | bzip2 |
123
+ | -V2.0 -X fast | 302.6 | 33 .5| 12.7 | (default, level 1) |
124
+ | -V2.0 (default) | 257.0 | 39.7 | 11.5 | (default) |
125
+ | -V2.0 -X small | 216.3| 123.8 | 32.0 | bzip2 |
106
126
||||||
107
- | -V3.0 | 223.7 | 39.9 | 9.8 | (default) |
108
- | -V3.0 -X fast | 274.0 | 35.6 | 10.6 | (default, level 1) |
109
- | -V3.0 -X small | 212.2| 94 .3| 18.0 | bzip2 |
110
- | -V3.0 -X archive | 209.3| 106.6 | 17.6 | bzip2, lzma |
127
+ | -V3.0 -X fast | 274.0 | 30.8 | 11.0 | (default, level 1) |
128
+ | -V3.0 (default) | 223.7 | 36.7 | 10.4 | (default) |
129
+ | -V3.0 -X small | 212.2| 90 .3| 18.2 | bzip2 |
130
+ | -V3.0 -X archive | 209.3| 103.5 | 18.2 | bzip2, lzma |
111
131
||||||
112
- | -V3.1 | 186.5| 38.3| 8.9| rANS++,tok3 |
113
- | -V3.1 -X fast | 282.7| 29.5| 9.2| rANS++ |
114
- | -V3.1 -X small | 177.0| 78.7| 33.3| rANS++,tok3,fqz |
115
- | -V3.1 -X archive | 172.1| 137.2| 34.9| rANS++,tok3,fqz,bzip2,arith|
116
-
132
+ | -V3.1 -X fast | 275.1| 28.6| 11.3| rANS++ |
133
+ | -V3.1 (default) | 186.2| 36.4| 8.5| rANS++,tok3 |
134
+ | -V3.1 -X small | 176.8| 77.9| 34.9| rANS++,tok3,fqz |
135
+ | -V3.1 -X archive | 172.0| 134.7| 34.0| rANS++,tok3,fqz,bzip2,arith|
136
+ ||||||
137
+ | -V4.0 -X fast | 258.4| 29.9| 11.2| rANS++ |
138
+ | -V4.0 (default) | 181.9| 34.3| 8.3| rANS++,tok3 |
139
+ | -V4.0 -X small | 170.8| 74.7| 34.4| rANS++,tok3,fqz |
140
+ | -V4.0 -X archive | 166.8| 122.0| 33.7| rANS++,tok3,fqz,bzip2,arith|
141
+
142
+ We also tested on a small human aligned HiSeq run (ERR317482)
143
+ representing older Illumina data with pre-binning era quality values.
144
+ This dataset shows less impressive gains with 4.0 over 3.0 in the
145
+ default profile, but major gains in small profile once fqzcomp quality
146
+ encoding is enabled.
147
+
148
+ Note for this file, the file sizes are larger meaning less disk
149
+ caching is possible (the test machine wasn't a memory stressed
150
+ desktop). Threading was also enabled, albeit with just 4 threads,
151
+ which further exacerbates I/O bottlenecks. The previous test
152
+ demonstrated BAM being faster to read than CRAM, but with large files
153
+ in a more I/O stressed situation this test demonstrates the default
154
+ profile of CRAM is faster to read than BAM, due to the smaller I/O
155
+ footprint.
156
+
157
+ | Scramble opts. | Size(MB) | Enc(s)| Dec(s)| Codecs used |
158
+ | -------------------- | --------:| -----:| -----:| --------------------------------|
159
+ | -t4 -O bam (default) | 6526 | 115.4| 44.7| bgzf(libdeflate) |
160
+ ||||||
161
+ | -t4 -V2.0 -X fast | 3674 | 87.4| 31.4| (default, level 1) |
162
+ | -t4 -V2.0 (default) | 3435 | 91.4| 30.7| (default) |
163
+ | -t4 -V2.0 -X small | 3373 | 145.5| 47.8| bzip2 |
164
+ | -t4 -V2.0 -X archive | 3377 | 166.3| 49.7| bzip2 |
165
+ | -t4 -V2.0 -X archive -9| 3125 | 1900.6| 76.9| bzip2 |
166
+ ||||||
167
+ | -t4 -V3.0 -X fast | 3620 | 88.3| 29.3| (default, level 1) |
168
+ | -t4 -V3.0 (default) | 3287 | 90.5| 29.5| (default) |
169
+ | -t4 -V3.0 -X small | 3238 | 128.5| 40.3| bzip2 |
170
+ | -t4 -V3.0 -X archive | 3220 | 164.9| 50.0| bzip2, lzma |
171
+ | -t4 -V3.0 -X archive -9| 3115 | 1866.6| 75.2| bzip2, lzma |
172
+ ||||||
173
+ | -t4 -V3.1 -X fast | 3611 | 87.9| 29.2| rANS++ |
174
+ | -t4 -V3.1 (default) | 3161 | 88.8| 29.7| rANS++,tok3 |
175
+ | -t4 -V3.1 -X small | 2249 | 192.2| 146.1| rANS++,tok3,fqz |
176
+ | -t4 -V3.1 -X archive | 2157 | 235.2| 127.5| rANS++,tok3,fqz,bzip2,arith |
177
+ | -t4 -V3.1 -X archive | 2145 | 480.3| 128.9| rANS++,tok3,fqz,bzip2,arith,lzma|
178
+ ||||||
179
+ | -t4 -V4.0 -X fast | 3551 | 87.8| 29.5| rANS++ |
180
+ | -t4 -V4.0 (default) | 3148 | 88.9| 30.0| rANS++,tok3 |
181
+ | -t4 -V4.0 -X small | 2236 | 189.7| 142.6| rANS++,tok3,fqz |
182
+ | -t4 -V4.0 -X archive | 2139 | 226.7| 127.5| rANS++,tok3,fqz,bzip2,arith |
183
+ | -t4 -V4.0 -X archive -9| 2132 | 453.5| 128.2| rANS++,tok3,fqz,bzip2,arith,lzma|
117
184
118
185
119
186
Building
0 commit comments