Things for ingest: Verify format of zephir_ingested_items.tsv.gz; reconsider handling for groove_full.tsv.gz

This is for both `groove_full.tsv.gz` and `zephir_ingested_items.tsv.gz`.

The format appears to be:

```
aeu.ark:/13960/t30305j3b        AEU     AEU     ia      ia.cihm_76481
```

I think that's an HTID, a campus(?) id, a collection id, a digitization source, and an optional IA ID, but we should double-check that.

I don't recall what the difference between the two items is -- maybe `groove_full.tsv.gz` is things that Zephir has bib data for, but aren't ingested? @cscollett could confirm.

In any case, it would probably make sense reconsider where we handle some of these things (https://github.com/hathitrust/feed_internal/blob/main/bin/jobs/feed.monthly/zephir_diff.pl#L32) and if/where we do verification. 

Especially for `groove_full.tsv.gz` --  given that post-zephir processing is just passing it along, and ingest already knows how to connect to the zephir ftps (https://github.com/hathitrust/feed_internal/blob/main/bin/jobs/feed.daily/GetZephirFiles.pm). This may have to be rethought when we use S3, but it should be easy enough to add an event-driven workflow in ingest as well for the files it cares about.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Things for ingest: Verify format of zephir_ingested_items.tsv.gz; reconsider handling for groove_full.tsv.gz #53

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Things for ingest: Verify format of zephir_ingested_items.tsv.gz; reconsider handling for groove_full.tsv.gz #53

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions