This is for both groove_full.tsv.gz and zephir_ingested_items.tsv.gz.
The format appears to be:
aeu.ark:/13960/t30305j3b AEU AEU ia ia.cihm_76481
I think that's an HTID, a campus(?) id, a collection id, a digitization source, and an optional IA ID, but we should double-check that.
I don't recall what the difference between the two items is -- maybe groove_full.tsv.gz is things that Zephir has bib data for, but aren't ingested? @cscollett could confirm.
In any case, it would probably make sense reconsider where we handle some of these things (https://github.com/hathitrust/feed_internal/blob/main/bin/jobs/feed.monthly/zephir_diff.pl#L32) and if/where we do verification.
Especially for groove_full.tsv.gz -- given that post-zephir processing is just passing it along, and ingest already knows how to connect to the zephir ftps (https://github.com/hathitrust/feed_internal/blob/main/bin/jobs/feed.daily/GetZephirFiles.pm). This may have to be rethought when we use S3, but it should be easy enough to add an event-driven workflow in ingest as well for the files it cares about.