Skip to content

fix: normalize torch version to >=2.1.0 across all dependency files#37

Open
spoturno wants to merge 1 commit into
kyegomez:mainfrom
spoturno:fix/torch-version-consistency
Open

fix: normalize torch version to >=2.1.0 across all dependency files#37
spoturno wants to merge 1 commit into
kyegomez:mainfrom
spoturno:fix/torch-version-consistency

Conversation

@spoturno

Copy link
Copy Markdown

Problem

The torch version requirement is inconsistent across the three dependency files:

File Was
pyproject.toml torch = "2.11.0" (exact pin to a non-existent version)
requirements.txt torch>=2.1.0
training/requirements.txt torch>=2.11.0 (also non-existent)

torch 2.11.0 does not exist on PyPI — the latest 2.x releases follow 2.1.x, 2.2.x, 2.3.x, etc. This causes install failures for anyone using pyproject.toml or the training requirements.

Fix

Aligned all three files to torch>=2.1.0, matching the existing requirements.txt convention.

pyproject.toml pinned torch==2.11.0 (an invalid version), requirements.txt
used >=2.1.0, and training/requirements.txt used >=2.11.0. Aligned all
three to >=2.1.0 for consistency.
amittell added a commit to amittell/OpenMythos that referenced this pull request Jun 6, 2026
… + sft_lora (PR kyegomez#37)

Three of the four GB10 cluster nodes have been sitting idle while 11 backfill
runs (inference_benchmark + sft_lora) wait on Blackwell. The new Phase A
auto-place feature from gpufarm PR kyegomez#37 substitutes resource_class: [auto]
with the configured priority [blackwell, gb10, rtx5090, strix_halo] at load
time so these jobs can fall back to GB10 when Blackwell is busy.

inference_benchmark runs 5 min and was previously blackwell-only; sft_lora
needs ~24 GB VRAM so any modern GPU fits. The other Blackwell-only jobs
(consolidate_single_host, training_round_joint, etc.) keep their explicit
declarations because they depend on Blackwell-specific behaviour or VRAM.
amittell added a commit to amittell/OpenMythos that referenced this pull request Jun 6, 2026
The 11 queued backfill runs (inference_benchmark + sft_lora) declare
resource_class: [auto] which substitutes to [blackwell, gb10, rtx5090,
strix_halo] per PR kyegomez#37. With Blackwell occupied by the Qwen campaign,
they should fall through to GB10. But cluster-4node's can_run list only
includes training/eval_4node jobs -- single-node compute jobs couldn't
dispatch there.

Adds individual node resources gb10-spark, gb10-gx10, gb10-gx10-3 with
can_run lists that include the standard single-node eval bundle. They
overlap cluster-4node on hardware -- operator-mutex pattern, same as
blackwell-host vs blackwell-gpu0/1.

gx10-2 is deliberately omitted because it currently hosts the Qwen
35B-A3B-FP8 BCB endpoint (~92 GB unified mem); add it back once that
completes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant