fix: normalize torch version to >=2.1.0 across all dependency files by spoturno · Pull Request #37 · kyegomez/OpenMythos

spoturno · 2026-04-21T20:00:29Z

Problem

The torch version requirement is inconsistent across the three dependency files:

File	Was
`pyproject.toml`	`torch = "2.11.0"` (exact pin to a non-existent version)
`requirements.txt`	`torch>=2.1.0`
`training/requirements.txt`	`torch>=2.11.0` (also non-existent)

torch 2.11.0 does not exist on PyPI — the latest 2.x releases follow 2.1.x, 2.2.x, 2.3.x, etc. This causes install failures for anyone using pyproject.toml or the training requirements.

Fix

Aligned all three files to torch>=2.1.0, matching the existing requirements.txt convention.

pyproject.toml pinned torch==2.11.0 (an invalid version), requirements.txt used >=2.1.0, and training/requirements.txt used >=2.11.0. Aligned all three to >=2.1.0 for consistency.

… + sft_lora (PR kyegomez#37) Three of the four GB10 cluster nodes have been sitting idle while 11 backfill runs (inference_benchmark + sft_lora) wait on Blackwell. The new Phase A auto-place feature from gpufarm PR kyegomez#37 substitutes resource_class: [auto] with the configured priority [blackwell, gb10, rtx5090, strix_halo] at load time so these jobs can fall back to GB10 when Blackwell is busy. inference_benchmark runs 5 min and was previously blackwell-only; sft_lora needs ~24 GB VRAM so any modern GPU fits. The other Blackwell-only jobs (consolidate_single_host, training_round_joint, etc.) keep their explicit declarations because they depend on Blackwell-specific behaviour or VRAM.

The 11 queued backfill runs (inference_benchmark + sft_lora) declare resource_class: [auto] which substitutes to [blackwell, gb10, rtx5090, strix_halo] per PR kyegomez#37. With Blackwell occupied by the Qwen campaign, they should fall through to GB10. But cluster-4node's can_run list only includes training/eval_4node jobs -- single-node compute jobs couldn't dispatch there. Adds individual node resources gb10-spark, gb10-gx10, gb10-gx10-3 with can_run lists that include the standard single-node eval bundle. They overlap cluster-4node on hardware -- operator-mutex pattern, same as blackwell-host vs blackwell-gpu0/1. gx10-2 is deliberately omitted because it currently hosts the Qwen 35B-A3B-FP8 BCB endpoint (~92 GB unified mem); add it back once that completes.

fix: normalize torch version to >=2.1.0 across all dependency files

b0a5b1a

pyproject.toml pinned torch==2.11.0 (an invalid version), requirements.txt used >=2.1.0, and training/requirements.txt used >=2.11.0. Aligned all three to >=2.1.0 for consistency.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: normalize torch version to >=2.1.0 across all dependency files#37

fix: normalize torch version to >=2.1.0 across all dependency files#37
spoturno wants to merge 1 commit into
kyegomez:mainfrom
spoturno:fix/torch-version-consistency

spoturno commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

spoturno commented Apr 21, 2026

Problem

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant