Skip to content

TrecDocs: .Z and .z files are different. #189

Open
@ArthurCamara

Description

@ArthurCamara

Describe the bug
I've stumbled on this before, and it seems like the same issue happens here. .z and .Z files are not always equivalent, but TrecDocs treat them like so by calling .lower() on the suffix of the Path object:

def _docs_iter(self, path):
if Path(path).is_file():
path_suffix = Path(path).suffix.lower()
if path_suffix == '.gz':
with gzip.open(path, 'rb') as f:
yield from self._parser(f)
elif path_suffix in ['.z', '.0z', '.1z', '.2z']:
# unix "compress" command encoding
unlzw3 = ir_datasets.lazy_libs.unlzw3()
with io.BytesIO(unlzw3.unlzw(path)) as f:
yield from self._parser(f)

.Z files are created by calling the Unix command compress:
(from the man page:

Compress reduces the size of the named files using adaptive Lempel-Ziv coding. Whenever possible, each file is replaced by one with the extension .Z (...)

while .z files are created by using gzip:

gunzip takes a list of files on its command line and replaces each file whose name ends with .gz, -gz, .z, -z, _z or .Z (...)

Note that gunzip can decompress BOTH formats, in theory, but, it seems like unlzw3 can only read the first (.Z)

There are some Disks45 distributions (mine, for instance) that are compressed with .z (i.e. using gunzip with option -S .z):

-S .suf --suffix .suf
When compressing, use suffix .suf instead of .gz. Any non-empty suffix can be given, but suffixes other than .z and .gz should be avoided to avoid confusion when files are transferred to other systems.

Affected dataset(s)

All that used TrecDocs, but Disks45 more likely.

To Reproduce
Trying to read documents with a .z compressed files results in this:

TypeError: string argument without an encoding

Additional context
Error is trigged on this line:

with io.BytesIO(unlzw3.unlzw(path)) as f:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions