Skip to content

Things for ingest: Verify format of zephir_ingested_items.tsv.gz; reconsider handling for groove_full.tsv.gz #53

@aelkiss

Description

@aelkiss

This is for both groove_full.tsv.gz and zephir_ingested_items.tsv.gz.

The format appears to be:

aeu.ark:/13960/t30305j3b        AEU     AEU     ia      ia.cihm_76481

I think that's an HTID, a campus(?) id, a collection id, a digitization source, and an optional IA ID, but we should double-check that.

I don't recall what the difference between the two items is -- maybe groove_full.tsv.gz is things that Zephir has bib data for, but aren't ingested? @cscollett could confirm.

In any case, it would probably make sense reconsider where we handle some of these things (https://github.com/hathitrust/feed_internal/blob/main/bin/jobs/feed.monthly/zephir_diff.pl#L32) and if/where we do verification.

Especially for groove_full.tsv.gz -- given that post-zephir processing is just passing it along, and ingest already knows how to connect to the zephir ftps (https://github.com/hathitrust/feed_internal/blob/main/bin/jobs/feed.daily/GetZephirFiles.pm). This may have to be rethought when we use S3, but it should be easy enough to add an event-driven workflow in ingest as well for the files it cares about.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions