Description
Describe the bug
I've stumbled on this before, and it seems like the same issue happens here. .z
and .Z
files are not always equivalent, but TrecDocs
treat them like so by calling .lower()
on the suffix of the Path
object:
ir_datasets/ir_datasets/formats/trec.py
Lines 127 to 137 in 27317b2
.Z files are created by calling the Unix command compress:
(from the man page:
Compress reduces the size of the named files using adaptive Lempel-Ziv coding. Whenever possible, each file is replaced by one with the extension .Z (...)
while .z files are created by using gzip:
gunzip takes a list of files on its command line and replaces each file whose name ends with .gz, -gz, .z, -z, _z or .Z (...)
Note that gunzip can decompress BOTH formats, in theory, but, it seems like unlzw3 can only read the first (.Z)
There are some Disks45 distributions (mine, for instance) that are compressed with .z
(i.e. using gunzip with option -S .z
):
-S .suf --suffix .suf
When compressing, use suffix .suf instead of .gz. Any non-empty suffix can be given, but suffixes other than .z and .gz should be avoided to avoid confusion when files are transferred to other systems.
Affected dataset(s)
All that used TrecDocs
, but Disks45 more likely.
To Reproduce
Trying to read documents with a .z
compressed files results in this:
TypeError: string argument without an encoding
Additional context
Error is trigged on this line:
ir_datasets/ir_datasets/formats/trec.py
Line 136 in 27317b2