[wip]Symm xpu #22

Chao1Han · 2025-08-04T09:49:38Z

Fixes #ISSUE_NUMBER

zhangxiaoli73 · 2025-08-05T07:48:25Z

torch/distributed/_symmetric_memory/__init__.py

+    #         shard_consumer(shards[remote_rank], remote_rank)
+
+    stage_size = 2
+    for stage in range(1, group_size, stage_size):


for stage in range(1, stage_size):

…rdering collectives (pytorch#160113)" This reverts commit 9d18bf0. Reverted pytorch#160113 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but lots of failures showing up after this lands ([comment](pytorch#160113 (comment)))

…torch#161084)" This reverts commit cfdaaaa. Reverted pytorch#161084 on behalf of https://github.com/huydhn due to My mistake in not checking for nvidia-smi availability ([comment](pytorch#161084 (comment)))

…0957) **Summary** When output dtype is fp8, oneDNN does not ensure intermediate results in the range of [-448, 448] before converting to fp8. So, we may get NaN in the output, which is a disaster for inference. This PR fixes this issue by clamping the intermediate results by oneDNN's post-op clip. **Test plan** ``` pytest -sv test/quantization/core/test_quantized_op.py -k "q and fp8" ``` Pull Request resolved: pytorch#160957 Approved by: https://github.com/Valentine233, https://github.com/CaoE

…torch#161136) expose the pointer so that we can create the `ncclConfig_t` object from pytorch and use it elsewhere. this is useful to control the nccl communicator parameters for multiple nccl communicators. Pull Request resolved: pytorch#161136 Approved by: https://github.com/kwen2501

Fixes issue that exhibited a device side memory access fault due to incorrect tensor life management Pull Request resolved: pytorch#154864 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <[email protected]>

@eellison

…collectives (pytorch#160113) 1. Applying @eellison idea from pytorch#146562 (comment) for estimate_peak_memory: ``` """ Alternative version of estimate_peak_memory, that respects the fact, that every SchedulerNode has multiple phases: 1. alloc ( outputs ) 2. run_kernel 3. dealloc last_use buffers estimate_peak_memory collapses memory into one value: size_alloc - size_free While peak memory happens after alloc. Duplicating the code to not migrate all callsites at once, In future usages of estimate_peak_memory will migrate to this version. """ ``` - Applying this in `reorder_communication_preserving_peak_memory` pass. 2. Buffers during reordering can change deallocation point, if candidate and group to swap both are users of the f_input_buf and group contains last_use_snode. - Addressing this tracking the last_use_snode for each buffer and recomputing current memory respecting the change in size_free (group_node after reordering is not the last user of the buffer and its size_free -= buffer_size, while candidate becomes the last user and candidate.size_free += buffer_size). 4. Adding env var `PYTORCH_REORDER_COLLECTIVES_LIMIT` for ablation to limit number of collectives to reorder. What is after this PR: Iterative recomputation of memory estimations matches full memory estimations. Active memory is not regressing a lot, but reserved memory is significantly regressed. Investigation and fix of "reserved" memory will be in following PRs. BASELINE (bucketing AG and RS): active: 32Gb reserved: 34Gb ``` [rank0]:[titan] 2025-08-11 11:28:36,798 - root - INFO - step: 1 loss: 12.2722 grad_norm: 4.2192 active_memory: 24.66GiB(25.96%) reserved_memory: 25.38GiB(26.72%) tps: 99 tflops: 5.71 mfu: 0.58% [rank0]:[titan] 2025-08-11 11:28:38,640 - root - INFO - step: 2 loss: 13.1738 grad_norm: 50.5566 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 4,448 tflops: 257.63 mfu: 26.05% [rank0]:[titan] 2025-08-11 11:28:40,029 - root - INFO - step: 3 loss: 15.6866 grad_norm: 80.0862 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 5,900 tflops: 341.72 mfu: 34.55% [rank0]:[titan] 2025-08-11 11:28:41,423 - root - INFO - step: 4 loss: 13.4853 grad_norm: 7.8538 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 5,881 tflops: 340.57 mfu: 34.44% [rank0]:[titan] 2025-08-11 11:28:42,820 - root - INFO - step: 5 loss: 16.1191 grad_norm: 53.2481 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 5,867 tflops: 339.77 mfu: 34.35% ``` REORDER: active: 32Gb reserved: 36Gb ``` [rank0]:[titan] 2025-08-11 11:34:32,772 - root - INFO - step: 1 loss: 12.2490 grad_norm: 4.1944 active_memory: 24.66GiB(25.96%) reserved_memory: 26.81GiB(28.22%) tps: 85 tflops: 4.90 mfu: 0.50% [rank0]:[titan] 2025-08-11 11:34:35,329 - root - INFO - step: 2 loss: 13.1427 grad_norm: 39.5942 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 3,205 tflops: 185.61 mfu: 18.77% [rank0]:[titan] 2025-08-11 11:34:36,770 - root - INFO - step: 3 loss: 14.6084 grad_norm: 51.0743 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 5,688 tflops: 329.44 mfu: 33.31% [rank0]:[titan] 2025-08-11 11:34:38,197 - root - INFO - step: 4 loss: 13.6181 grad_norm: 8.1122 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 5,744 tflops: 332.68 mfu: 33.64% [rank0]:[titan] 2025-08-11 11:34:39,821 - root - INFO - step: 5 loss: 15.8913 grad_norm: 59.8510 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 5,046 tflops: 292.22 mfu: 29.55% ``` REORDER + SINK_WAITS_ITERATIVE: active: 35Gb reserved: 41Gb ``` [rank0]:[titan] 2025-08-11 11:31:36,119 - root - INFO - step: 1 loss: 12.2646 grad_norm: 4.1282 active_memory: 27.60GiB(29.05%) reserved_memory: 32.49GiB(34.20%) tps: 173 tflops: 10.00 mfu: 1.01% [rank0]:[titan] 2025-08-11 11:31:37,452 - root - INFO - step: 2 loss: 13.2353 grad_norm: 42.4234 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,152 tflops: 356.26 mfu: 36.02% [rank0]:[titan] 2025-08-11 11:31:38,780 - root - INFO - step: 3 loss: 13.8205 grad_norm: 24.0156 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,169 tflops: 357.29 mfu: 36.13% [rank0]:[titan] 2025-08-11 11:31:40,106 - root - INFO - step: 4 loss: 13.1033 grad_norm: 9.1167 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,183 tflops: 358.10 mfu: 36.21% [rank0]:[titan] 2025-08-11 11:31:41,443 - root - INFO - step: 5 loss: 16.3530 grad_norm: 51.8118 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,130 tflops: 355.03 mfu: 35.90% ``` Differential Revision: [D79886535](https://our.internmc.facebook.com/intern/diff/D79886535) Pull Request resolved: pytorch#160113 Approved by: https://github.com/wconstab, https://github.com/eellison Co-authored-by: eellison <[email protected]>

…rdering collectives (pytorch#160113)" This reverts commit 517d38d. Reverted pytorch#160113 on behalf of https://github.com/IvanKobzarev due to Segment tree starts failing on trunk even ciflows/trunk passed on PR ([comment](pytorch#160113 (comment)))

…)" This reverts commit febfc3e. Reverted pytorch#160794 on behalf of https://github.com/seemethere due to This if failing internal tests, see D80671241 ([comment](pytorch#160794 (comment)))

Summary: Added a `_set_node_metadata_hook` which automatically adds node.meta["val"] to every new node that gets created under this context. Test Plan: ` buck2 test //mtia/host_runtime/afg/tests:test_dynamic_shapes_advanced_ops` https://www.internalfb.com/buck2/866439a2-2ba6-42d1-8e43-508d60456e2e `buck2 test //mtia/host_runtime/afg/tests:test_dynamic_shapes_basic_ops` https://www.internalfb.com/intern/testinfra/testrun/11540474149662857 Rollback Plan: Differential Revision: D80579336 Pull Request resolved: pytorch#161019 Approved by: https://github.com/blaine-rister

…1084) Fixes pytorch#160988. The root cause can be found in the same issue. This fix ensures that when reuse old wheel is on and `torchaudio` wheel is not there, the inductor test job can still rebuild the wheel it needs Pull Request resolved: pytorch#161084 Approved by: https://github.com/malfet, https://github.com/zou3519

…ytorch#160902) This adds a new function `bypass_package` and `CompilePackage.bypass_current_entry()`. This allows us to safely bypass if there are models with unserializable or incompatible parts. When we encounter something incompatible, we'll raise a bypass and ignore that particular code in DynamoCodeEntry. Pull Request resolved: pytorch#160902 Approved by: https://github.com/zhxchen17

Summary: att - should be no-op Test Plan: buck2 test //caffe2/test/cpp/nativert:static_kernel_ops_tests Tests finished: Pass 24. Fail 0. Fatal 0. Skip 0. Build failure 0 Rollback Plan: Differential Revision: D80216488 Pull Request resolved: pytorch#161087 Approved by: https://github.com/georgiaphillips, https://github.com/henryoier

…ach operations (pytorch#161128) Summary: The `reserve()` method is used to pre-allocate memory for the result vector before adding elements to it. This is an optimization that makes sense for several reasons: 1. Performance improvement: By pre-allocating memory for the exact number of elements needed, it avoids multiple reallocations and memory copies that would occur as the vector grows dynamically. 2. Memory efficiency: It ensures that the vector allocates exactly the amount of memory needed, no more and no less, which is efficient when we know the final size in advance. 3. Reduced overhead: Each reallocation typically involves: - Allocating a new, larger block of memory - Copying all existing elements to the new location - Destroying the old elements - Deallocating the old memory block - Consistent performance: Without reservation, vector growth typically follows a geometric progression (like 1, 2, 4, 8, 16...), which can lead to unpredictable performance spikes when reallocation occurs. Test Plan: OSS CI & tests Rollback Plan: Differential Revision: D80674453 Pull Request resolved: pytorch#161128 Approved by: https://github.com/Skylion007

…h#161039) Summary: C++'s polymorphism and reusing components help us reduce the amount of bolierplate codes here. Test Plan: CI & tests Rollback Plan: Differential Revision: D80594353 Pull Request resolved: pytorch#161039 Approved by: https://github.com/janeyx99

Disable min/max macro on Windows. Pull Request resolved: pytorch#161133 Approved by: https://github.com/angelayi

Changes: 1. change set `include_dirs` to append value. 2. add append `libraries_dirs` for level_zero. Pull Request resolved: pytorch#161146 Approved by: https://github.com/angelayi

pytorch#160713) Pull Request resolved: pytorch#160713 Approved by: https://github.com/eellison

Fixes pytorch#161066 There is a size check here, which causes the error. https://github.com/pytorch/pytorch/blob/8ce81bcee1da294a34af0a90dc16483055e8c5a4/aten/src/ATen/native/mps/operations/Pad.mm#L39-L40 If the argument `pad` is empty, it will return the cloned tensor on CPU. https://github.com/pytorch/pytorch/blob/8ce81bcee1da294a34af0a90dc16483055e8c5a4/aten/src/ATen/native/PadNd.cpp#L43-L64 Therefore, this PR fixes the empty padding argument error by checking the size first and returning a cloned tensor immediately if the padding size is 0. Pull Request resolved: pytorch#161149 Approved by: https://github.com/malfet

torch._inductor.virtualized.OpsValue objects instance does not have shape attribute. This breaks the fp8 test on ROCm. Add the OpsValue class in todo list. Pull Request resolved: pytorch#161166 Approved by: https://github.com/jeffdaily

…ytorch#160660) Summary: In the consolidate_safetensors_files_on_every_rank method, where we use multiple ranks to combine sharded safetensors files, if there are more ranks in the world size, than there are safetensors file to consolidate, then some ranks don't have to do any work. When I had tested, this case wasn't caught, and there was an extra barrier call, causing issues for the ranks that had no work to do. They should wait at the end, as do the ranks with work. Test Plan: tested this case on a job e2e added a unit test Rollback Plan: Differential Revision: D80273616 Pull Request resolved: pytorch#160660 Approved by: https://github.com/sibuachu

…orch#161145) Combine multiple profiles into one: ``` python profile_analysis.py --combine <file1> <file2> ... <out> ``` This only works well if they have different pids, like from different programs in a distributed run. <img width="1521" height="465" alt="combining_multiple_profiles" src="https://github.com/user-attachments/assets/aba7112b-e9a9-4075-b82b-a4e4408384da" /> Pull Request resolved: pytorch#161145 Approved by: https://github.com/xmfan

setup vllm test logics. 1. install wheels generated from previous build stage 2. generate and install vllm test pkg list on run time based on the torch wheels in the instance 3. run test based on the pre-defined test plan notice the test-plan format is temporary for some basic vllm testing Pull Request resolved: pytorch#160361 Approved by: https://github.com/atalman, https://github.com/huydhn

…orch#160482) Previously, DTensor kept its own copy of the generator state after the first time a random operator was called on a DTensor. This copy would evolve independently from the generator outside of DTensor. After adding support for users to pass a specific generator into random operators (e.g. `uniform_(..., generator=)`), it was determined (in discussion on pytorch#159991) to change the semantics so that any random operations performed on DTensor would evolve the state of the publicly visible generators (either the default one or user-passed one). The upsides are (1) it is now possible to call torch.manual_seed() at any point in the program and have a consistent effect on DTensor, (2) DTensor ops have an observable effect on the generator. The downside is that users are now responsible for seeding their generator before using DTensor, ensuring all ranks use the same seed. Fixes pytorch#159991 confirmed docs rendered OK <img width="897" height="414" alt="image" src="https://github.com/user-attachments/assets/c082f0f0-5447-47aa-834f-65342eb237cd" /> Pull Request resolved: pytorch#160482 Approved by: https://github.com/wanchaol

Update outer reduction heuristics for significant speedups. HuggingFace: <img width="572" height="705" alt="Screenshot 2025-08-20 at 12 44 51 AM" src="https://github.com/user-attachments/assets/4872a23b-d136-423a-b2e6-187895bccba1" /> Average ~20% speedup on a kernel by kernel basis TorchBench: <img width="572" height="705" alt="Screenshot 2025-08-20 at 12 45 10 AM" src="https://github.com/user-attachments/assets/b8357b6d-6107-4104-b906-292a17d14d48" /> Average ~40% speedup on a kernel by kernel basis <img width="1705" height="729" alt="Screenshot 2025-08-21 at 5 50 32 PM" src="https://github.com/user-attachments/assets/a9715a2b-9e6c-4b33-ba9f-7870dc561e31" /> Differential Revision: [D80630416](https://our.internmc.facebook.com/intern/diff/D80630416) Pull Request resolved: pytorch#159093 Approved by: https://github.com/jansel

…orch#161201) By changing dtype to float if device is MPS Note: for some reason test runs much longer on MPS than on CPU ``` % python ../test/test_indexing.py -v -k test_index_put_accumulate_duplicate_indices_mps test_index_put_accumulate_duplicate_indices_mps (__main__.TestIndexingMPS.test_index_put_accumulate_duplicate_indices_mps) ... ok ---------------------------------------------------------------------- Ran 1 test in 9.139s OK ``` Pull Request resolved: pytorch#161201 Approved by: https://github.com/dcci

…addmm (pytorch#155357)" This reverts commit ce048de. Reverted pytorch#155357 on behalf of https://github.com/seemethere due to This is causing buck builds to fail since we didn't add the definition of AT_USE_EIGEN_SPARSE in the buckbuild.bzl file, will follow-up and re-land this. ([comment](pytorch#155357 (comment)))

Bumps [uv](https://github.com/astral-sh/uv) from 0.8.4 to 0.8.6. - [Release notes](https://github.com/astral-sh/uv/releases) - [Changelog](https://github.com/astral-sh/uv/blob/main/CHANGELOG.md) - [Commits](astral-sh/uv@0.8.4...0.8.6) --- updated-dependencies: - dependency-name: uv dependency-version: 0.8.6 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…ch#160205) Parallelize reading of data behind thread_count argument to HFStorageReader Test plan: ensure existing tests pass and run a job successfully with these changes Differential Revision: [D79478188](https://our.internmc.facebook.com/intern/diff/D79478188/) Pull Request resolved: pytorch#160205 Approved by: https://github.com/meetv18

Summary: att - changed one of the tests to get rid of torcharrow dep. Test Plan: ``` buck2 test //caffe2/test/cpp/nativert:layout_planner_tests Tests finished: Pass 15. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Rollback Plan: Reviewed By: SherlockNoMad Differential Revision: D80108549 Pull Request resolved: pytorch#160942 Approved by: https://github.com/georgiaphillips, https://github.com/henryoier

Signed-off-by: Justin Chu <[email protected]> Pull Request resolved: pytorch#161322 Approved by: https://github.com/titaiwangms

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: pytorch#161331 Approved by: https://github.com/pytorchbot

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: pytorch#161227 Approved by: https://github.com/pytorchbot

After we switched to constructing the registry with the specified opset version in dynamo=True, support for opset<18 was broken because there would be no torchlib ops registered for these opsets. I updated the registry creation logic to always use opset 18 if the requested opset is lower, and use the version converter (as designed) to target those opsets. This requires onnxscript>=0.4 (pytorch#161312) Fixes onnx/onnx#7235 Pull Request resolved: pytorch#161056 Approved by: https://github.com/titaiwangms

Enable more vllm test against pytorch main, add schedule to run the test every 12 hours. Pull Request resolved: pytorch#161192 Approved by: https://github.com/huydhn

…torch#161170) Users encountered unexpected behaviour when using FlexAttention with learnable biases, including assertion errors (pytorch#157677) We traced the root cause to the registration of subgraph buffers—this caused inconsistencies in the naming and ultimately incorrect retrieval later on. This problem only arose if the model was compiled as a whole (ie using @torch.compile) since only then would there be naming conflicts. In this PR, we register the buffers with the base graph to solve this issue. Pull Request resolved: pytorch#161170 Approved by: https://github.com/drisspg

Check MSVC's language pack: pytorch#157673 (comment) Pull Request resolved: pytorch#161298 Approved by: https://github.com/angelayi

…orch#159404)" This reverts commit f15ada5. Reverted pytorch#159404 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#159404 (comment)))

…ytorch#160919) Summary: [The diff was reverted due to CLA error, in the process of retrieving account] Previous error message ``` RuntimeError: Expected input at *args.<unknown location>.shape[0] to be equal to 4096, but got 7680. If you meant for this dimension to be dynamic, please re-export and specify dynamic_shapes (e.g. with Dim.DYNAMIC) ``` New error message ``` RuntimeError: Expected input at *args.[0].supervision_input.weight.shape[0] to be equal to 4096, but got 7680. If you meant for this dimension to be dynamic, please re-export and specify dynamic_shapes (e.g. with Dim.DYNAMIC) ``` Test Plan: ``` buck test mode/opt apf/rec/ir/tests:ir_export_deserialize_test ``` https://www.internalfb.com/intern/testinfra/testrun/4785074906254375 ``` buck run mode/opt caffe2/test:test_export -- -r unflatten ``` ``` Ran 413 tests in 208.414s OK (skipped=1, expected failures=13) ``` Rollback Plan: Differential Revision: D80487367 Pull Request resolved: pytorch#160919 Approved by: https://github.com/angelayi

Summary: AMD specific kwargs need to be removed from the guard, otherwise a keyerror will be raised when executing the kernel. Test Plan: ``` buck2 run mode/opt-amd-gpu -m rocm641 -c fbcode.split-dwarf=true -c fbcode.use_link_groups=true -c fbcode.enable_gpu_sections=true //hpc/new/models/feed/benchmark:feed_lower_benchmark -- --load=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/894698382/0/gpu_lowering/new_input8 --skip-eager --skip-flop-estimation --sync-mode=0 --lower-backend=AOT_INDUCTOR ``` can succeed after this change. Rollback Plan: Differential Revision: D80285441 Pull Request resolved: pytorch#160671 Approved by: https://github.com/muchulee8

@d4l3k

# Context In pytorch#160163, we added support for NUMA binding for `Callable` entrypoints to `elastic_launch`. This requires special consideration, because they go through a different path to spawn subprocesses compared to `str` entrypoints, a path which does not provide a straightforward way to utilize `numactl` CLI. See pytorch#160006 for a full description of the challenges. Although pytorch#160163 worked in initial local experiments, we ran into some linker errors in other environments when we tried to call `numactl`. This appeared to be due to interactions with how the `LD_PRELOAD` environment variable was being set. # This PR On further thought, the most straightforward, foolproof solution here is to use [the trick that @d4l3k suggested.](pytorch#160006 (comment)) Specifically, for each local rank `i`: 1. The parent process sets its own CPU affinity to what local rank `i`'s should be. 2. Then, the parent spawns the subprocess for local rank `i`. 3. Finally, the parent resets its own CPU affinity to what it was originally. There were other solutions that would work just for `Callable` entrypoints, but I believe this is the simplest one that can work for *both* `str` and `Callable`, and it's pretty simple. This required a bit of refactoring: 1. Turn all the `_get_.*_numactl_options` into functions which return a set of logical CPUs to bind to, rather than options like `--cpunodebind=0`. 2. Instead of wrapping commands with `numactl`, use `os.sched_setaffinity` to bind to the CPUs from (1.). 3. Put this all inside a context manager which encapsulates applying and restoring the bindings in the parent process. 4. Use the context manager for both `str` and `Callable` paths # Test Plan ## Automated `$ pytest test/test_numa_binding.py` ## Manual See [doc.](https://docs.google.com/document/d/1vxD-OKYBTT27jbBwtW9iz9g0tNM0u-i0tiTJg_ieQA8/edit?tab=t.0) Meta only, but TLDR tried out every combination of `str`, `Callable`, binding disabled, and binding enabled on the same model and saw 2x SM utilization for binding enabled. Pull Request resolved: pytorch#161183 Approved by: https://github.com/d4l3k

…1182)" This reverts commit e20f6d7. Reverted pytorch#161182 on behalf of https://github.com/zou3519 due to broke dynamo_wrapped tests, those are a bit finicky to fix (there is probably more than one failure!) ([comment](pytorch#161182 (comment)))

By passing strides to strided variant of the tensor Fixes pytorch#160993 Pull Request resolved: pytorch#161333 Approved by: https://github.com/huydhn, https://github.com/wdvr ghstack dependencies: pytorch#161206, pytorch#161267

…ytorch#155357) This pull request adds the following ops for sparse matrices using Eigen library: ```python add(a_csr, b_csr) add(a_csc, b_csc) addmm(c_csr, a_csr, b_csr) addmm(c_csr, a_csr, b_csc) addmm(c_csr, a_csc, b_csc) addmm(c_csr, a_csc, b_csr) addmm(c_csc, a_csr, b_csr) addmm(c_csc, a_csr, b_csc) addmm(c_csc, a_csc, b_csc) addmm(c_csc, a_csc, b_csr) ``` Currently, the operations for sparse matrices on CPU are available through MKL only. The non-existence of MKL on `aarch64` causes the unavailability of these ops on any machines with ARM based CPUs, including Apple Silicon, AWS Graviton and NVIDIA Grace. This PR addresses this issue by using Eigen as a backend for the above ops. This is a re-factored version of my previous PR pytorch#101814. The main difference with the old one, this does not enable Eigen by default. Pull Request resolved: pytorch#155357 Approved by: https://github.com/pearu, https://github.com/eqy Co-authored-by: Eli Uriegas <[email protected]>

@ptrblck

…ytorch#161316) pytorch#159779 CUDA 13 added the support for --compress-mode flag for nvcc across all drivers of CUDA 13.X toolkits, enabling the possibility to use --compress-mode=size for significant size reduction (~71% less for CUDA Math APIs for example). https://developer.nvidia.com/blog/whats-new-and-important-in-cuda-toolkit-13-0/ Why we have to add for CUDA 13 only, quote from @ptrblck : Any usage of --compress-mode=size/balance will drop the support of older CUDA drivers and will bump the min. driver requirement to CUDA 12.4. pytorch#157791 (comment) Default for CUDA 13 will be --compress-mode=balance which gives smaller binaries than LZ4 speed mode used in previous CUDA versions. Related - pytorch#157791 Pull Request resolved: pytorch#161316 Approved by: https://github.com/nWEIdia, https://github.com/Skylion007

A single-device version of Muon. Algorithm refers Keller Jordan's [Muon blogpost](https://kellerjordan.github.io/posts/muon/), and optionally incorporates [Moonshot's](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf) learning rate adjustment strategy. This implementation maintains a minimalist API and is consistent with other optimizer conventions. PyTorch team prefers to handle parameter filtering at a higher level, with the Muon optimizer performing only the msign computation for orthogonalization on all parameters it receives. Users are responsible for grouping parameters for different optimizers as needed. An example usage is shown below, and a more detailed example will be added to the [PyTorch examples](https://github.com/pytorch/examples) directory. **Usage** ```python model = MyModelForCausalLM # filter out your params manually muon_params = [...] adamw_params = [...] muon = Muon( params = muon_params lr=lr, wd=wd, ) adamw = AdamW( params = adamw_params lr=lr, wd=wd, ) # in training loop loss = model(input) loss.backward() muon.step() adamw.step() muon.zero_grad() adamw.zero_grad() ``` ~~**Additional usage**~~ ~~Users are also able to pass in self-defined `msign` function for orthogonalization, and learning rate adjustment function. Interface defined below:~~ ```python ~~AdjustLrFn: TypeAlias = Callable[[float, torch.Size], float]~~ ~~MsignFn: TypeAlias = Callable[[Tensor, BaseMsignFnConfig], Tensor]~~ ``` As discussed with team and in comment, we prefer to make the interface simpler and cleaner, thus we removed the callback interface, and canonicalize the original NS algorithm for Muon. The only configs available to users are `ns_steps`, `coefficients`, and `eps`, configurable through kwargs. By default, we use 5-step Newton-Schulz, with coefficients proposed by [Keller](https://kellerjordan.github.io/posts/muon/). We use LR adjustment proposed by [Moonshot](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf), which grafts learning rate from AdamW. **Testing** ~~1. Unit tests: the newly introduced Muon is covered in `test/test_optim.py`. We updated the test cases to pass named parameters to the optimizer under test. Additionally, we introduced a new test case to verify that when the user provides an empty FQN list, Muon correctly falls back to AdamW behavior.~~ As discussed, in order not to complicate the codebase, we prefer not to include reference implementation into PyTorch. We also updated the interface so we don't need to test the FQN based filtering. Muon is covered by the existing `test_optim.py` unit test. 2. End-to-end test: we added a training script that pre-trains a QWEN-like model on `openwebtext-100k` dataset. We trained for one epoch and the resulting loss curve is compared against the Moonshot implementation to confirm behavioral consistency. <img width="1102" height="472" alt="Screenshot 2025-07-29 at 1 04 12 AM" src="https://github.com/user-attachments/assets/ceab0733-497d-4070-8032-02ae7995c64c" /> **Numerics** We evaluate our implementation with existing implementation to confirm numerical consistency. As discussed, our implementation closely follows the algorithm described in [Keller's post](https://kellerjordan.github.io/posts/muon/), while incorporating the learning rate adjustment from [Moonlight](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf). This captures a key insight that allows users to reuse hyper-parameters tuned for `adamW`, making Muon a drop-in swap. As expected, the numerics difference mainly comes from `adjust_lr`, a max of ~5% relative diff in an example unit test setup below. ```python # dummy model and data model0 = Linear(10, 10, bias=False) model1 = copy.deepcopy(model0) inputs = torch.randn(8, 10) targets = torch.randn(8, 10) loss = MSELoss() lr = 1e-3 wd = 0.1 momentum = 0.95 opt_ref_muon = KellySingleDeviceMuon( params=model0.parameters(), lr=lr, weight_decay=wd, momentum=momentum, ) opt_exp_muon = Muon( params=model1.parameters(), lr=lr, weight_decay=wd, momentum=momentum, ) out_ref = model0(inputs) loss_ref = loss(out_ref, targets) opt_ref_muon.zero_grad() loss_ref.backward() opt_ref_muon.step() out_exp = model1(inputs) loss_exp = loss(out_exp, targets) opt_exp_muon.zero_grad() loss_exp.backward() opt_exp_muon.step() for p_ref, p_exp in zip(model0.parameters(), model1.parameters()): torch.testing.assert_close(p_ref, p_exp) ``` As explained above, including this `adjust_lr` is preferable. This is validated by an e2e training runs on training a qwen-2-like 0.5b model, where the curves show that training with `adjust_lr` converges more effectively than without. <img width="1179" height="464" alt="Screenshot 2025-08-18 at 10 12 33 AM" src="https://github.com/user-attachments/assets/e797d3da-c2f0-4187-b99e-5d48b7437c3c" /> **Performance** Training for one epoch of openwebtext-100k on eight H100 GPUs with DDP: - adamw_ddp finishes in 13.12 min - pytorch_muon_ddp finishes in 13.45 min Muon runs ~20s slower compared to AdamW. Assuming no other changes, Muon is *2.5%* slower than AdamW. AdamW: Optimizer.step() takes ~13.5 ms, step time ~930 ms <img width="726" height="590" alt="Screenshot 2025-07-29 at 1 56 14 AM" src="https://github.com/user-attachments/assets/ebcd7e1c-d129-4b20-9396-39f568edf03d" /> Muon: Optimizer.step() takes ~54 ms, step time ~960 ms <img width="751" height="597" alt="Screenshot 2025-07-29 at 2 02 20 AM" src="https://github.com/user-attachments/assets/72f5b904-ebd5-4502-a6ff-d3e9e5a6da81" /> **Note** We restrict the implementation to accept only 2D parameters. An alternative approach is to allow parameters with more than two dimensions and apply orthogonalization over the last two dimensions. We opt not to go with this approach as it can be error-prone. For example, with a kernel shaped `[in_channel, height, width, out_channel]`, applying orthogonalization to the last two dimensions is not meaningful. Since Muon is designed to operate orthogonalization on 2D matrices, preserving this assumption keeps the implementation clean and sound. **Next Steps** 1. Add `MuP` 2. Open-source optimized triton kernel for symmetric matmul. A preliminary benchmark found 1.23x - 1.48x speedup on small - large (n = 256 -> 16384) matrices. 3. Open-source unsharded Muon co-designed with FSDP2. **** Pull Request resolved: pytorch#160213 Approved by: https://github.com/janeyx99

…gistration in hooks (pytorch#161238) Per title Pull Request resolved: pytorch#161238 Approved by: https://github.com/kwen2501, https://github.com/syed-ahmed

…orch#160482) Previously, DTensor kept its own copy of the generator state after the first time a random operator was called on a DTensor. This copy would evolve independently from the generator outside of DTensor. After adding support for users to pass a specific generator into random operators (e.g. `uniform_(..., generator=)`), it was determined (in discussion on pytorch#159991) to change the semantics so that any random operations performed on DTensor would evolve the state of the publicly visible generators (either the default one or user-passed one). The upsides are (1) it is now possible to call torch.manual_seed() at any point in the program and have a consistent effect on DTensor, (2) DTensor ops have an observable effect on the generator. The downside is that users are now responsible for seeding their generator before using DTensor, ensuring all ranks use the same seed. Fixes pytorch#159991 confirmed docs rendered OK <img width="897" height="414" alt="image" src="https://github.com/user-attachments/assets/c082f0f0-5447-47aa-834f-65342eb237cd" /> Pull Request resolved: pytorch#160482 Approved by: https://github.com/wanchaol

In this PR we will port all distributed pipeline test files. We could enable Intel GPU with following methods and try the best to keep the original code styles: 1. instantiate_device_type_tests() 2. use "torch.accelerator.current_accelerator()" to determine the accelerator backend 3. use "requires_accelerator_dist_backend()" to replace requires_nccl() 4. use "get_default_backend_for_device()" to get backend 5. enabled XPU for some test path Pull Request resolved: pytorch#159033 Approved by: https://github.com/guangyey, https://github.com/kwen2501

This reverts commit 97b55d1.

zhangxiaoli73 reviewed Aug 5, 2025

View reviewed changes

pytorchmergebot and others added 29 commits August 21, 2025 08:13

Revert "Fix torchaudio build when TORCH_CUDA_ARCH_LIST is not set (py…

acb00d3

…torch#161084)" This reverts commit cfdaaaa. Reverted pytorch#161084 on behalf of https://github.com/huydhn due to My mistake in not checking for nvidia-smi availability ([comment](pytorch#161084 (comment)))

Revert "flip the list-as-tuple behavior for short lists (pytorch#160794…

a6401cb

…)" This reverts commit febfc3e. Reverted pytorch#160794 on behalf of https://github.com/seemethere due to This if failing internal tests, see D80671241 ([comment](pytorch#160794 (comment)))

[inductor] disable min/max macro on Windows. (pytorch#161133)

1e3fe78

Disable min/max macro on Windows. Pull Request resolved: pytorch#161133 Approved by: https://github.com/angelayi

[inductor] add libraries_dirs for level_zero (pytorch#161146)

db38c44

Changes: 1. change set `include_dirs` to append value. 2. add append `libraries_dirs` for level_zero. Pull Request resolved: pytorch#161146 Approved by: https://github.com/angelayi

[invoke_subgraph][inductor] Thread graphsafe rng input states for hops (

5805c42

pytorch#160713) Pull Request resolved: pytorch#160713 Approved by: https://github.com/eellison

forward fix of pytorch#152198 (pytorch#161166)

d2b8c0d

torch._inductor.virtualized.OpsValue objects instance does not have shape attribute. This breaks the fp8 test on ROCm. Add the OpsValue class in todo list. Pull Request resolved: pytorch#161166 Approved by: https://github.com/jeffdaily

justinchuby and others added 27 commits August 23, 2025 02:42

[ONNX] Remove unused _onnx_supported_ops (pytorch#161322)

38a492d

Signed-off-by: Justin Chu <[email protected]> Pull Request resolved: pytorch#161322 Approved by: https://github.com/titaiwangms

enable more tests (pytorch#161192)

6443ea3

Enable more vllm test against pytorch main, add schedule to run the test every 12 hours. Pull Request resolved: pytorch#161192 Approved by: https://github.com/huydhn

[inductor] add MSVC language pack check. (pytorch#161298)

22df59e

Check MSVC's language pack: pytorch#157673 (comment) Pull Request resolved: pytorch#161298 Approved by: https://github.com/angelayi

Revert "Enable output padding when only outermost dim is dynamic (pyt…

710514a

…orch#159404)" This reverts commit f15ada5. Reverted pytorch#159404 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#159404 (comment)))

[MPS] Fix index_copy for strided indices (pytorch#161333)

4acdbb8

By passing strides to strided variant of the tensor Fixes pytorch#160993 Pull Request resolved: pytorch#161333 Approved by: https://github.com/huydhn, https://github.com/wdvr ghstack dependencies: pytorch#161206, pytorch#161267

[nccl symm mem] don't use arg for mempool, correctly use symmetric re…

726dce3

…gistration in hooks (pytorch#161238) Per title Pull Request resolved: pytorch#161238 Approved by: https://github.com/kwen2501, https://github.com/syed-ahmed

enable symm on xpu

cb61c1a

enable fused matmul+reducescatter

27a2420

refine fused matmul and reducescatter

7bb8e78

fp8 scaled support

0339c94

format

1b514dc

Enable XPU built with Level Zero (pytorch#321)

97b55d1

stage < TP

f4b119e

register op

9bd58a6

Chao1Han force-pushed the symm-xpu branch from d61199a to 9bd58a6 Compare August 25, 2025 06:15

Chao1Han added 2 commits August 25, 2025 14:23

update

c1d75fb

Revert "Enable XPU built with Level Zero (pytorch#321)"

3cc755e

This reverts commit 97b55d1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[wip]Symm xpu #22

[wip]Symm xpu #22

Uh oh!

Chao1Han commented Aug 4, 2025

Uh oh!

zhangxiaoli73 Aug 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

60 participants

[wip]Symm xpu #22

Are you sure you want to change the base?

[wip]Symm xpu #22

Uh oh!

Conversation

Chao1Han commented Aug 4, 2025

Uh oh!

zhangxiaoli73 Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

60 participants