TrecDocs: .Z and .z files are different.

**Describe the bug**
I've stumbled on this before, and it seems like the same issue happens here. `.z` and `.Z` files are not always equivalent, but `TrecDocs` treat them like so by calling `.lower()` on the suffix of the `Path` object:

https://github.com/allenai/ir_datasets/blob/27317b2951a2c7f843ffc7c8d0b245acdc784c7f/ir_datasets/formats/trec.py#L127-L137

.Z files are created by calling the Unix command [compress](https://linux.die.net/man/1/compress):
(from the man page:
 > Compress reduces the size of the named files using adaptive Lempel-Ziv coding. Whenever possible, each file is replaced by one with the extension .Z (...)

while .z files are created by using [gzip](https://linux.die.net/man/1/gzip):
> gunzip takes a list of files on its command line and replaces each file whose name ends with .gz, -gz, .z, -z, _z or .Z  (...)

Note that gunzip can decompress BOTH formats, in theory, but, it seems like [unlzw3](https://github.com/scivision/unlzw3) can only read the first (.Z)

There are some Disks45 distributions (mine, for instance) that are compressed with `.z` (i.e. using gunzip with option `-S .z`):
> -S .suf --suffix .suf
    When compressing, use suffix .suf instead of .gz. Any non-empty suffix can be given, but suffixes other than .z and .gz should be avoided to avoid confusion when files are transferred to other systems.

**Affected dataset(s)**

All that used `TrecDocs`, but Disks45 more likely.

**To Reproduce**
Trying to read documents with a `.z` compressed files results in this:

```python
TypeError: string argument without an encoding
```

**Additional context**
Error is trigged on this line:
https://github.com/allenai/ir_datasets/blob/27317b2951a2c7f843ffc7c8d0b245acdc784c7f/ir_datasets/formats/trec.py#L136


	def _docs_iter(self, path):
	if Path(path).is_file():
	path_suffix = Path(path).suffix.lower()
	if path_suffix == '.gz':
	with gzip.open(path, 'rb') as f:
	yield from self._parser(f)
	elif path_suffix in ['.z', '.0z', '.1z', '.2z']:
	# unix "compress" command encoding
	unlzw3 = ir_datasets.lazy_libs.unlzw3()
	with io.BytesIO(unlzw3.unlzw(path)) as f:
	yield from self._parser(f)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TrecDocs: .Z and .z files are different. #189

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TrecDocs: .Z and .z files are different. #189

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions