Skip to content

[snapshot-runtime] Harden Snapshot Download Pipeline for Fork Deploy Reliability #98

@kangeunchan

Description

@kangeunchan

Enhancement Description

  • One-line description (release-note style): Harden snapshot download pipeline with resume, parallel range fetch, and clearer progress/failure semantics for Cosmos fork deploy reliability.
  • Scope type: Snapshot download and extraction reliability hardening

Background

Snapshot download became a critical bottleneck/failure point for deploy --fork on Cosmos mainnet/testnet.

This issue tracks:

  • Download-path reliability improvements that were implemented beyond baseline plugin completion scope

Scope

  • Add parallel range download with single-stream fallback
    • File: internal/infrastructure/snapshot/download.go
    • Includes range capability probe, fallback behavior, and safe finalization
  • Add resume support and temporary file continuation behavior
    • File: internal/infrastructure/snapshot/download.go
    • Includes .tmp continuation and server response handling
  • Improve progress reporting and timeout handling for long-running snapshot downloads
    • Files: internal/infrastructure/snapshot/download.go, internal/application/devnet/progress_bar.go
  • Refactor snapshot download flow for clearer execution contract and tests
    • Files: internal/infrastructure/snapshot/download.go, internal/infrastructure/snapshot/download_test.go
  • Expose snapshot timeout on deploy path
    • Files: cmd/devnet-builder/commands/manage/deploy.go, internal/application/dto/devnet_dto.go, internal/application/devnet/provision.go, internal/application/devnet/run.go

Non-Goals

  • Replacing external snapshot providers
  • Changing Cosmos plugin business logic (genesis mutation/RPC semantics)
  • Introducing a generalized multi-chain snapshot framework

Risks and Open Questions

  • Provider-side per-connection throttling may still dominate throughput
  • Resume behavior differs across providers/CDNs that do not honor Range consistently
  • Locking/concurrency policy for same cache key in concurrent deploy invocations may need stricter guarantees

Validation Plan

Unit and Integration Checks

  • ASDF_GOLANG_VERSION=1.24.0 go test ./internal/infrastructure/snapshot/...
  • ASDF_GOLANG_VERSION=1.24.0 go test ./internal/application/devnet/...
  • Add explicit tests for concurrent same-cache-key download contention

End-to-End Checks

  • devnet-builder deploy --blockchain cosmos --network mainnet --fork --snapshot-timeout 2h
  • devnet-builder deploy --blockchain cosmos --network testnet --fork --snapshot-timeout 2h
  • Validate resume behavior by interrupting download and rerunning deploy

Evidence Required in Issue Updates

  • Download throughput/progress evidence for mainnet snapshot
  • Resume continuation evidence (.tmp reused) after interruption
  • Fallback path evidence when range/parallel is not available

Acceptance Criteria

  • Snapshot download succeeds reliably under long-running network conditions
  • Interrupted downloads can resume without full restart when provider supports Range
  • Parallel mode improves time-to-download where provider/network allows
  • Fallback mode remains correct and deterministic when parallel is unavailable

Deliverables

  • Pushed snapshot download pipeline improvements branch for review
  • Pushed unit tests covering range/resume/fallback behaviors for review
  • Follow-up notes documenting provider-specific throughput/fallback observations

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions