Skip to content

Optimize remote input handling with list-then-download-per-file #127

@malon64

Description

@malon64

Summary

Remote runs currently list inputs and download all files up-front before precheck/processing. This can be wasteful if a run aborts early or only needs a subset.

Proposed change

  • Keep the initial list step to compute the input set.
  • Download inputs just-in-time per file before precheck/read.
  • Cache downloaded paths to avoid double downloads across stages.

Notes

  • This likely touches run -> precheck -> read pipeline because InputFile expects source_local_path to exist today.
  • Should preserve dry-run list-only behavior.

Acceptance

  • No behavior change in results.
  • Reduced temp usage and unnecessary downloads for remote sources.

Metadata

Metadata

Assignees

Labels

area:ioIO formats/read/writearea:storageStorages (local/s3/adls/gcs)help wantedExtra attention is needed

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions