[AUTOGENERATED] develop_IFU_20251106 #2789

pragupta · 2025-11-06T17:06:29Z

rocm_base: 3d74218

Summary: * Pull in `f4f4bf16` from FBGemm to provide MXFP4 support for CUDA * Add testing Test Plan: Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <[email protected]> Pull Request resolved: pytorch#166526 Approved by: https://github.com/drisspg, https://github.com/ngimel

Results from CI: No failures but generally takes longer, maybe ~20% increase in time? But the smaller runner is ~25% of the cost of the current runner, so in terms of cost this is a decrease If the 20% is too much, we can try the 4x larger runners, which are about half the cost of the current runner, so it would probably still result in cost savings with hopefully less impact to time Pull Request resolved: pytorch#164989 Approved by: https://github.com/BoyuanFeng, https://github.com/huydhn

You can just subtract timestamps, but this makes it easier Pull Request resolved: pytorch#166447 Approved by: https://github.com/Skylion007

We store a mapping between generated fx graph code and original model code stack trace in `fx.traceback._FX_METADATA_REGISTRY`. And we do a post-processing on the memory snapshot to append the original model stack trace information. To achieve this, the biggest change we had to do in `aot_eager` mode is to give each generated fx graph a unique stack trace, i.e. it cannot just be `<eval_with_key>`. We set co_filename to **pretend** that the code is from `co_filename` file. Now instead of `<eval_with_key>` in stack trace, we get something like `fx_generated_3a4b5c6d7e8f9a0.py`. `augment_with_fx_traces` arg is added to `torch.cuda.memory._snapshot` and `_dump_snapshot`. When the arg is set to True, a post-processing will run to populate the original model stack trace to the snapshot frames. The new behavior of GraphModule can be controlled by `TORCH_ENRICH_RPOFILER_STACK_TRACE` or `_dynamo.config.enrich_profiler_metadata=True`. Alternative: Instead of setting co_filename, we can also do it like below: Note that if we do it this way, we will need to dump the file to make the graph module torch-scriptable. TorchScript requires source access in order to carry out compilation, so we need to make sure original .py files are available. ``` key = filename globals_copy = globals.copy() globals_copy["__file__"] = key globals_copy["__name__"] = key linecache.lazycache(key, globals_copy) exec(compile(src, key, "exec"), globals) ```` Other changes: - Update `MemoryViz.js` to display fx node information and original model code if exist ``` python test/test_fx.py -k test_lineno_map python test/test_fx.py -k test_custom_traceback_raised python test/test_public_bindings.py python test/test_cuda.py -k test_fx_memory python test/test_fx.py -k test_informative_co_filename python test/test_fx.py -k test_autowrap_functions python test/dynamo/test_utils.py -k test_inductor_provenance ``` ```python # Profile with memory snapshot torch.cuda.memory._record_memory_history() with torch._dynamo.config.patch("enrich_profiler_stack_trace", True): compiled = torch.compile(mod, backend="aot_eager", fullgraph=True) result = compiled(torch.randn(10, 10, device="cuda:0")) torch.cuda.memory._dump_snapshot("memory_snapshot.pickle", augment_with_fx_traces=True) torch.cuda.memory._record_memory_history(enabled=None) ``` <img width="913" height="711" alt="Screenshot 2025-10-30 at 10 40 44 AM" src="https://github.com/user-attachments/assets/8d7a1833-f98d-4756-b666-1d63ab57b27b" /> Pull Request resolved: pytorch#166676 Approved by: https://github.com/albanD, https://github.com/ezyang

Instead of `(void) foo; // Unused parameter` trick, as this is a C++17 standard feature Will replace further repetitions of the same pattern soon after Pull Request resolved: pytorch#166865 Approved by: https://github.com/mikaylagawarecki, https://github.com/Skylion007, https://github.com/janeyx99

…uction consumer (pytorch#166165) Prefer unfused addmm when there is at least a single elemwise/reduction consumer.. Pull Request resolved: pytorch#166165 Approved by: https://github.com/eellison

…66467) This adds the capability to subproc pool to enable quiesce via a timer Pull Request resolved: pytorch#166467 Approved by: https://github.com/masnesral

The deprecation warning led to warning spamming in PyTorch APIs, like torch.compile. This is not how a deprecation warning should go: if we add a deprecation warning, we'd better update our built-in APIs to prevent warning spam. Pull Request resolved: pytorch#166956 Approved by: https://github.com/albanD

Fixed some syntax errors in SECURITY.md file including PyTorch's capitalization problems, some grammatical inconsistencies, etc Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#166718 Approved by: https://github.com/mikaylagawarecki

…#166961) This is a PR to temporarily relieve the queueing that is caused by an mi250 node outage. See this ticket for more information: pytorch#166866 It relaxes the GPU count check to allow distributed jobs to run on 2-GPU runners Pull Request resolved: pytorch#166961 Approved by: https://github.com/jeffdaily

… in CI (pytorch#165922) Fix and regression test for pytorch#165801 Pull Request resolved: pytorch#165922 Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/Skylion007, https://github.com/drisspg Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: Andrey Talman <[email protected]>

Pull Request resolved: pytorch#166976 Approved by: https://github.com/maggiemoss, https://github.com/Skylion007

…#166768) And simplify the entire function to just assert and return Pull Request resolved: pytorch#166768 Approved by: https://github.com/cyyever, https://github.com/atalman

Draft to expose compiled saved tensor hook context to selectively apply them. Exposing node, fw_graph, bw_graph. Pull Request resolved: pytorch#166887 Approved by: https://github.com/bdhirsh

…h#165036)" This reverts commit 0e1a889. Reverted pytorch#165036 on behalf of https://github.com/atalman due to regressed vllm signal: [GH job link](https://github.com/pytorch/pytorch/actions/runs/19059329909/job/54439919668) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/0e1a88904f4a5e30634b196678b56e1d6ec074f5) ([comment](pytorch#165036 (comment)))

…166669) For mix-order reduction, we current force XBLOCK to be 1 to simplify codegen. Don't tune it in CDT. Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [D86224689](https://our.internmc.facebook.com/intern/diff/D86224689) Pull Request resolved: pytorch#166669 Approved by: https://github.com/jansel, https://github.com/mlazos, https://github.com/eellison, https://github.com/v0i0

Noticed that workflow runs for `trunk/{sha}` tags (issued by autorevert) don't populate test_run_s3 Clickhouse table. This PR is addressing this by changing the gate condition to upload tests stats. see https://github.com/pytorch/pytorch/actions/runs/19054297956/job/54421254448#step:8:23 as an evidence that HEAD_BRANCH is correctly populated for trunk tags. Pull Request resolved: pytorch#166916 Approved by: https://github.com/huydhn, https://github.com/clee2000

This includes sm103 triton-lang/triton#8485 fix Pull Request resolved: pytorch#166968 Approved by: https://github.com/Lucaskabela, https://github.com/njriasan

…ytorch#166973) Pull Request resolved: pytorch#166973 Approved by: https://github.com/eellison, https://github.com/jathu

So many times i build pytorch only to notice chef nuked my nvcc and i wasted 30m building a cpu version, lets hard error fast Pull Request resolved: pytorch#166982 Approved by: https://github.com/malfet ghstack dependencies: pytorch#166976

…166581) Major change is to switch to a timer based implementation. Additionally, we get rid of the context manager for turning of the compile pool. We still have the warmup calls. Note that this only modifies the async_compile methods, the fx pool is left running. Pull Request resolved: pytorch#166581 Approved by: https://github.com/masnesral ghstack dependencies: pytorch#166467

Fixes pytorch#159445 ### Summary - Fixed a stride layout issue in the `torch.linalg.eig` meta kernel that prevented successful compilation with the inductor backend. The meta kernel was producing incorrect row-major strides. - LAPACK/BLAS libraries (underlying implementation) expect column-major layout Pull Request resolved: pytorch#162484 Approved by: https://github.com/isuruf

…ects (pytorch#166917) Fixes pytorch#166900 Implementation notes: - I tried to disallow guard generation before side effect application in order to futureproof improper guard generation. However, this was not feasible since it is possible to realize lazy VTs while generating side effects (e.g. realizing a constant variable that is used in a deque update). - `codegen_save_tempvars` now generates `TempLocalSource` for create temporary variables now, so that they won't get confused with `LocalSource` - we should error out when we attempt to create guards for `TempLocalSource`. I considered using `SyntheticLocalSource`, but that has additional `subguards_allowed` behavior that we may not want to have for temp variables. - We moved the guard installation for constant user-defined pytree objects from `as_python_constant` to `__init__`. Objects created outside the compile-region will be guarded, while objects created inside the compile-region will not be guarded. Pull Request resolved: pytorch#166917 Approved by: https://github.com/anijain2305

Slice knows how to handle unbacked start, we do not need to offset start before calling slice, we can leave it for slice. The only edge case is when start<0 and start+length ==0 in that case slice and narrow would deviate, for that case we shall pass dim_size instead of start+length Pull Request resolved: pytorch#166361 Approved by: https://github.com/aorenste

Especially the job identifier can contain spaces so needs to be quoted Fixes e.g. https://github.com/pytorch/pytorch/actions/runs/19063797853/job/54449422160#step:15:52 Pull Request resolved: pytorch#166955 Approved by: https://github.com/Skylion007

Adds optional "node" id for tensors, output info annotations to DebugMode, with `DebugMode(record_output=True, record_ids=True)` Example output for `test_debug_mode_mm`, with both enabled: ``` torch.mm(dt$0: f32[8, 8]| S(0), dt$1: f32[8, 32]| S(0)) -> dt$12: f32[8, 32]| S(0) aten::mm(dt$2: f32[8, 8]| S(0), dt$3: f32[8, 32]| S(0)) redistribute_input(1, S(0) -> R) redistribute_input(t$4: f32[1, 32], trace: S(0)->R) _c10d_functional::all_gather_into_tensor(t$5: f32[1, 32], 8, 0) -> t$6: f32[8, 32] _c10d_functional::wait_tensor(t$7: f32[8, 32]) -> t$8: f32[8, 32] aten::mm(t$9: f32[1, 8], t$10: f32[8, 32]) -> t$11: f32[1, 32] <method 'sum' of 'torch._C.TensorBase' objects>(dt$13: f32[8, 32]| S(0)) -> dt$17: f32[]| P aten::sum(dt$14: f32[8, 32]| S(0)) aten::sum(t$15: f32[1, 32]) -> t$16: f32[]""" ``` Sadly the only way to get DTensor op outputs is to set `record_torchfunction=True`, as dispatch calls just defer to DTensor's dispatch logic. Pull Request resolved: pytorch#165076 Approved by: https://github.com/zpcore

```python python test/test_fx.py -k profiler ``` Insert `torch._C._profiler._RecordFunctionFast` to fx graph codegen. We post-process the profiler dump using `map_recorded_events_to_aten_ops_with_stack_trace` to add the stack trace to the dump'd trace. `map_recorded_events_to_aten_ops_with_stack_trace` queries `fx.traceback._FX_METADATA_REGISTRY` for node metadata. Each graph module has a hash'd fake file name (e.g. `fx_generated__iv4zodvbcmdkhx77jrg7h2f2opebujhfmc6tf6nx7vioq244baw.py`), which is the key to the registry. One can do `fx_g.enrich_profiler_metadata()` to add debugging info. Or `fx_g.enrich_profiler_metadata(enable=False)` to remove. `aot_eager` makes calls `fx_g.enrich_profiler_metadata()` if TORCH_ENRICH_RPOFILER_STACK_TRACE is set or _dynamo.config.enrich_profiler_metadata=True. <img width="1188" height="565" alt="Screenshot 2025-10-31 at 4 40 52 PM" src="https://github.com/user-attachments/assets/41e8113f-3e6d-439b-bffd-cfbf0c03a47a" /> Example code gen'd. ``` def forward(self, args_list): args_iter = iter(args_list) arg0_1 = next(args_iter) arg1_1 = next(args_iter) args_list.clear() _rf = torch._C._profiler._RecordFunctionFast('## fx_generated__iv4zodvbcmdkhx77jrg7h2f2opebujhfmc6tf6nx7vioq244baw.py ##'); _rf.__enter__() repeated_subgraph0 = self.repeated_subgraph0 _rf_invoke_subgraph = torch._C._profiler._RecordFunctionFast('## 3 ##'); _rf_invoke_subgraph.__enter__() invoke_subgraph = torch.ops.higher_order.invoke_subgraph(repeated_subgraph0, 'subgraph_0', arg0_1, arg1_1); repeated_subgraph0 = arg0_1 = arg1_1 = None _rf_invoke_subgraph.__exit__(None, None, None) _rf_getitem = torch._C._profiler._RecordFunctionFast('## 4 ##'); _rf_getitem.__enter__() getitem = invoke_subgraph[0]; invoke_subgraph = None _rf_getitem.__exit__(None, None, None) return (getitem,) _rf.__exit__(None, None, None) def forward(self, arg0_1, arg1_1): _rf = torch._C._profiler._RecordFunctionFast('## fx_generated__ozpadpj5cxoalxeyopej33g2vvtvhxg4xsk7bhx7ldmcibtybyn.py ##'); _rf.__enter__() _rf_mul = torch._C._profiler._RecordFunctionFast('## 2 ##'); _rf_mul.__enter__() mul = torch.ops.aten.mul.Tensor(arg0_1, arg1_1); arg0_1 = arg1_1 = None _rf_mul.__exit__(None, None, None) _rf_sin = torch._C._profiler._RecordFunctionFast('## 3 ##'); _rf_sin.__enter__() sin = torch.ops.aten.sin.default(mul); mul = None _rf_sin.__exit__(None, None, None) _rf_add = torch._C._profiler._RecordFunctionFast('## 4 ##'); _rf_add.__enter__() add = torch.ops.aten.add.Tensor(sin, 5); sin = None _rf_add.__exit__(None, None, None) return (add,) _rf.__exit__(None, None, None) ``` Pull Request resolved: pytorch#166677 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#166676

…ytorch#166993) Combo kernel warns for long reduction and large pointwise. This becomes too spammy for users such as vLLM. This PR moves these logs from warn to debug. I validated the spammy log is removed on llama-3.1-8B. Pull Request resolved: pytorch#166993 Approved by: https://github.com/zou3519, https://github.com/eellison

This reverts commit e8052f2. Reverted pytorch#166677 on behalf of https://github.com/malfet due to Broke lint, please rebase, we've moved from mypy to pyrefly ([comment](pytorch#166677 (comment)))

pytorch#158081 Pull Request resolved: pytorch#166379 Approved by: https://github.com/Lucaskabela ghstack dependencies: pytorch#166361

This PR continues to fix or remove unused loop variables in tests. Pull Request resolved: pytorch#167043 Approved by: https://github.com/Lucaskabela

…rgs (pytorch#166368) Intended to make it easier to reuse this logic for processing operator arguments as IValues in following PR(s). Testing: python test/test_python_dispatch.py (broke during development, seems to work now) Pull Request resolved: pytorch#166368 Approved by: https://github.com/albanD

Previously the log only printed if the default implementation for an action was used, now it prints before dispatching to custom registered actions. Tested by running on autoparallel graph runner and observing forward pass action logged Pull Request resolved: pytorch#167113 Approved by: https://github.com/sanketpurandare, https://github.com/Skylion007

This PR adds return types of some Python functions. Most of them return `None`. The types were added automatically by ruff `ANN` rules. Pull Request resolved: pytorch#167162 Approved by: https://github.com/Lucaskabela

…7351) (pytorch#166622) ### Summary Adds a debug-level logging statement to torch.fx.Interpreter.run_node, as proposed in [pytorch#117351](pytorch#117351), to make FX graph execution traceable when debugging or instrumenting model transformations. When debug logging is enabled, each executed node emits a single structured log line formatted via `LazyString(lambda: n.format_node())`, deferring string construction unless logging is active. ### Example Output With `logging.DEBUG` enabled: ``` run_node x = x() run_node add = _operator.add(x, 1) run_node clamp = torch.clamp(add, min=0.0, max=5.0) run_node output = output(clamp) ``` With `logging.DEBUG` disabled no additional output is produced (unchanged default behavior). ### Test Plan Verified locally with Python 3.11 on macOS using a PyTorch build from source. - With `logging.DEBUG` enabled: each node emits a debug log via LazyString. - With `logging.DEBUG` disabled: no additional output. - Confirmed all `Interpreter` tests pass locally: `pytest test/test_fx.py -k "Interpreter"` Updated the example output to reflect the new `_format_fx_node` helper and inclusion of `kwargs`. Pull Request resolved: pytorch#166622 Approved by: https://github.com/aorenste

…167000) The previous PR was not enough to prevent errors caused by cpython dynamo tests in 3.14 Pull Request resolved: pytorch#167000 Approved by: https://github.com/mlazos, https://github.com/guilhermeleobas

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: pytorch#167031 Approved by: https://github.com/pytorchbot

Pull Request resolved: pytorch#167141 Approved by: https://github.com/Lucaskabela

Pull Request resolved: pytorch#167151 Approved by: https://github.com/Lucaskabela ghstack dependencies: pytorch#167141

Pull Request resolved: pytorch#167152 Approved by: https://github.com/Lucaskabela ghstack dependencies: pytorch#167141, pytorch#167151

…torch#166699) Exposing this flag as some upstream frameworks (like vLLM) could benefit from knowing whether torch.compile caches are enabled or not to adjust their own caching behavior. Pull Request resolved: pytorch#166699 Approved by: https://github.com/oulgen, https://github.com/mlazos

To pick a single change pytorch/tensorpipe@2b4cd91 that should fix compilation errors with clang-21 Pull Request resolved: pytorch#167108 Approved by: https://github.com/Skylion007

pytorch#158081 Pull Request resolved: pytorch#166379 Approved by: https://github.com/Lucaskabela ghstack dependencies: pytorch#166361

Add Inline Fusion Support for Custom Op Autotuning -------------------------------------------------- This PR extends PyTorch Inductor's custom op autotuning with inline fusion capabilities, enabling the winning decomposition to be inlined directly into the computation graph for fusion with surrounding operations. ### Usage ```python def decompose_k_implementation( a: torch.Tensor, b: torch.Tensor, k_splits: int = 4 ) -> torch.Tensor: """Matrix multiply with k-way decomposition.""" ... @torch.library.custom_op("my_lib::matmul_relu", mutates_args={}) def custom_matmul_relu_dk( a: torch.Tensor, b: torch.Tensor, k_splits: int ) -> torch.Tensor: return torch.relu(decompose_k_implementation(a, b, k_splits)) register_custom_op_autotuning( custom_op=custom_matmul_relu_dk, configs=[ CustomOpConfig(k_splits=2), CustomOpConfig(k_splits=4), CustomOpConfig(k_splits=8), CustomOpConfig(k_splits=32), CustomOpConfig(k_splits=64), ], name="decompose_k_autotuned", input_gen_fns={ "a": lambda fake: torch.randn_like(fake, device='cuda'), "b": lambda fake: torch.randn_like(fake, device='cuda'), } ) ``` ### How It Works Enable optimizations from Inductor by inlining the best decomposition, allowing fusion with surrounding elementwise operations and other graph-level optimizations. This provide potentially better performance and memory efficiency. During customop autotuning phase, we still benchmarks all CustomOpConfigs to find the fastest implementation. Then during inline fusion, inductor inline the decompositions into the main graph, converting the winning choice to individual ComputedBuffer IR nodes (fusable). At the end, Inductor automatically fuses inlined operations with surrounding elementwise ops (e.g., bias add, ReLU, scaling). Note that the winning choice must be a SubgraphChoiceCaller (decomposition-based) rather than an ExternKernelChoice for inlining to work. If the ExternKernelChoice is returned, no inline happens. Performance Results Benchmarked on matmul+relu workload with decompose-k fusion (H100 GPU, 15 test shapes): <img width="782" height="377" alt="Screenshot 2025-11-04 at 12 43 11 AM" src="https://github.com/user-attachments/assets/22131d4c-a8ce-4f55-bdcd-ac758ddad8cd" /> Metric | Result -- | -- Average Speedup vs ATen | 1.28x Max Speedup vs ATen | 1.41x <br class="Apple-interchange-newline"> The performance comparison are detailed in the below plots. We spot that on most use cases, the inline fusion gains better performance compared to aten baseline and the current torch.compile. <img width="4874" height="3545" alt="image" src="https://github.com/user-attachments/assets/190a1233-412f-4f34-84cd-9b7cb582f504" /> **Test**: `test_decompose_k_with_fusion` demonstrates decompose-k with inline fusion enabled. -------------- ### Integration to mm.py decomposeK with a flag enable_inline_subgraph_fusion=True in config (deprecated to avoid breaking async compilation. removed from the PR already) FP32: <img width="738" height="357" alt="Screenshot 2025-11-04 at 12 05 08 AM" src="https://github.com/user-attachments/assets/ee421d22-c426-42f2-8dcd-4dcc547d6219" /> FP16: <img width="769" height="403" alt="Screenshot 2025-11-04 at 12 13 49 AM" src="https://github.com/user-attachments/assets/346d1ffc-15af-40b0-9378-cf9b297711c2" /> The TCF column represents torch compile fusion, which is close to custom_op decomposek. The difference might due to different candidate k values. #### Usage: Note: this only happens when we don't benchmark_epilogue_fusion, i.e., not using multi_template_buffer. ```python # Define the matmul+relu function def matmul_relu(x, y): return torch.nn.functional.relu(torch.matmul(x, y)) # Compile with inline subgraph fusion enabled @torch.compile def compiled_matmul_relu(x, y): return matmul_relu(x, y) # Reset dynamo to ensure clean compilation torch._dynamo.reset() with config.patch( { "max_autotune": True, # CRITICAL: These two flags enable inline subgraph fusion "benchmark_epilogue_fusion": False, # Must be False for inline fusion! "enable_inline_subgraph_fusion": True, # Enable inline fusion } ): # Compile and run result = compiled_matmul_relu(a, b) torch.cuda.synchronize() ``` Pull Request resolved: pytorch#165952 Approved by: https://github.com/PaulZhang12, https://github.com/eellison

Error in Helion CI's AMD job: https://github.com/pytorch/helion/actions/runs/19118581048/job/54633730633 ``` > (binary.metadata.num_ctas, *binary.metadata.cluster_dims) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ if hasattr(binary, "metadata") else () ) ), "function": get_first_attr(binary, "function", "cu_function"), "runner": get_first_attr(binary, "run", "c_wrapper"), "math": math_lib, "torch": torch_lib, "triton": triton_lib, } E torch._inductor.exc.InductorError: AttributeError: 'KernelMetadata' object has no attribute 'cluster_dims' ``` Pull Request resolved: pytorch#167187 Approved by: https://github.com/oulgen

This PR uses `key in dict` expressions for existence checks of dict elements in Python code. This operation is more efficient than `key in dict.keys()`. Pull Request resolved: pytorch#167174 Approved by: https://github.com/mlazos

This PR continues to fix or remove unused loop variables in tests. Pull Request resolved: pytorch#166921 Approved by: https://github.com/mlazos

…6951)" This reverts commit a74fe75. Reverted pytorch#166951 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#166951 (comment)))

…ass (pytorch#167080) Split of pytorch#162469 to be under 2K reorder iterative part Pull Request resolved: pytorch#167080 Approved by: https://github.com/eellison

…167089) Summary: Add MTIA as a native device type in PyTorch. Test Plan: CI Reviewed By: PatriceVignola Differential Revision: D80111801 Pull Request resolved: pytorch#167089 Approved by: https://github.com/andyanwang, https://github.com/nautsimon, https://github.com/albanD

…61035) torch.cuda.memory.set_per_process_memory_fraction allows setting an upper bound on how much device memory is allocated. This PR exposes this setting to an environment variable. For example, PYTORCH_CUDA_ALLOC_CONF="per_process_memory_fraction:0.5" will limit the device memory to half of the available memory. Pull Request resolved: pytorch#161035 Approved by: https://github.com/ngimel, https://github.com/eqy

…rch#167081) Split of pytorch#162469 to be under 2K reorder iterative part Pull Request resolved: pytorch#167081 Approved by: https://github.com/eellison ghstack dependencies: pytorch#167080

I.e. remove distinction between two cases, and always preload full set of libraries For some reason, when one uses `virtualenv` instead of `venv`, preloading `cudart` works, but it fails to find cudnn or cublasLT later on Fix it, by getting read of partial preload logic for one of the cases and always preload full set of libraries Test plan on stock Ubuntu: ``` pip install virtualenv virtualenv --symlinks -p python3.11 --prompt virtv venv-virt source venv-virt/bin/activate pip install torch python -c 'import torch' ``` Fixes pytorch#165812 Pull Request resolved: pytorch#167046 Approved by: https://github.com/atalman

…n weight is sparse (pytorch#166071) As per title. It seems safe to be able to generalize to arbitrary contiguous inputs since `at::matmul` is likely to do the flattening to avoid `baddmm`. Additionally, we guard for bias to be 1D and contiguous which is guaranteed to be fused with no copies. Pull Request resolved: pytorch#166071 Approved by: https://github.com/ngimel

# Conflicts: # .ci/docker/ci_commit_pins/triton.txt

rocm-repo-management-api · 2025-11-06T17:20:36Z

Jenkins build for d81ea9c2635396625aef23cbae9ff6f6e373df0c commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

slayton58 and others added 30 commits November 4, 2025 15:53

[ez] Print some more test timing info in the logs (pytorch#166447)

8d4b8ab

You can just subtract timestamps, but this makes it easier Pull Request resolved: pytorch#166447 Approved by: https://github.com/Skylion007

[Inductor] addmm with bias -> unfuse bias if there is a pointwise/red…

eefa163

…uction consumer (pytorch#166165) Prefer unfused addmm when there is at least a single elemwise/reduction consumer.. Pull Request resolved: pytorch#166165 Approved by: https://github.com/eellison

subproc_pool: Add support for enabling quiesce via a timer (pytorch#1…

3144713

…66467) This adds the capability to subproc pool to enable quiesce via a timer Pull Request resolved: pytorch#166467 Approved by: https://github.com/masnesral

More pyrefly local errors (pytorch#166976)

a5f3035

Pull Request resolved: pytorch#166976 Approved by: https://github.com/maggiemoss, https://github.com/Skylion007

[BE] Delete Python-3.9 stdlib definitions from torch.package (pytorch…

52ea135

…#166768) And simplify the entire function to just assert and return Pull Request resolved: pytorch#166768 Approved by: https://github.com/cyyever, https://github.com/atalman

[aotd] Compiled saved tensor hooks context (pytorch#166887)

cef98ae

Draft to expose compiled saved tensor hook context to selectively apply them. Exposing node, fw_graph, bw_graph. Pull Request resolved: pytorch#166887 Approved by: https://github.com/bdhirsh

Update triton to 3.5.1 release (pytorch#166968)

b4e4ee8

This includes sm103 triton-lang/triton#8485 fix Pull Request resolved: pytorch#166968 Approved by: https://github.com/Lucaskabela, https://github.com/njriasan

[inductor] runtime estimations disable use_nccl_estimator by default (p…

2bba373

…ytorch#166973) Pull Request resolved: pytorch#166973 Approved by: https://github.com/eellison, https://github.com/jathu

Revert "Add model code stack trace to torch.profile (pytorch#166677)"

81038fd

This reverts commit e8052f2. Reverted pytorch#166677 on behalf of https://github.com/malfet due to Broke lint, please rebase, we've moved from mypy to pyrefly ([comment](pytorch#166677 (comment)))

make narrow_tensor_symint DDE-free (pytorch#166379)

d7e2d0a

pytorch#158081 Pull Request resolved: pytorch#166379 Approved by: https://github.com/Lucaskabela ghstack dependencies: pytorch#166361

cyyever and others added 25 commits November 6, 2025 03:36

[7/N] Fix unused loop variables in tests (pytorch#167043)

d31599f

This PR continues to fix or remove unused loop variables in tests. Pull Request resolved: pytorch#167043 Approved by: https://github.com/Lucaskabela

[1/N] Add return types of Python functions (pytorch#167162)

c3c3653

This PR adds return types of some Python functions. Most of them return `None`. The types were added automatically by ruff `ANN` rules. Pull Request resolved: pytorch#167162 Approved by: https://github.com/Lucaskabela

[dynamo, 3.14] disable dynamo cpython tests in 3.14 (again) (pytorch#…

eea9517

…167000) The previous PR was not enough to prevent errors caused by cpython dynamo tests in 3.14 Pull Request resolved: pytorch#167000 Approved by: https://github.com/mlazos, https://github.com/guilhermeleobas

[user-streams] Enable stream ops to work in eager (pytorch#167141)

f7b7f40

Pull Request resolved: pytorch#167141 Approved by: https://github.com/Lucaskabela

[user-streams] Add record/wait ops (pytorch#167151)

46b3f91

Pull Request resolved: pytorch#167151 Approved by: https://github.com/Lucaskabela ghstack dependencies: pytorch#167141

[user-streams] Mark stream ops as side effectful (pytorch#167152)

7b423c2

Pull Request resolved: pytorch#167152 Approved by: https://github.com/Lucaskabela ghstack dependencies: pytorch#167141, pytorch#167151

Update tensorpipe submodule (pytorch#167108)

09d8953

To pick a single change pytorch/tensorpipe@2b4cd91 that should fix compilation errors with clang-21 Pull Request resolved: pytorch#167108 Approved by: https://github.com/Skylion007

make narrow_tensor_symint DDE-free (pytorch#166379)

9eebda9

pytorch#158081 Pull Request resolved: pytorch#166379 Approved by: https://github.com/Lucaskabela ghstack dependencies: pytorch#166361

[8/N] Fix unused loop variables in tests (pytorch#166921)

80ec2ab

This PR continues to fix or remove unused loop variables in tests. Pull Request resolved: pytorch#166921 Approved by: https://github.com/mlazos

Revert "Don't hardcode double argument for reduction base (pytorch#16…

b2d72a4

…6951)" This reverts commit a74fe75. Reverted pytorch#166951 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#166951 (comment)))

[inductor] Use runtime estimations in iterative reorder collectives p…

2005b5f

…ass (pytorch#167080) Split of pytorch#162469 to be under 2K reorder iterative part Pull Request resolved: pytorch#167080 Approved by: https://github.com/eellison

[inductor] Use runtime estimations in iterative sink waits pass (pyto…

cc477f6

…rch#167081) Split of pytorch#162469 to be under 2K reorder iterative part Pull Request resolved: pytorch#167081 Approved by: https://github.com/eellison ghstack dependencies: pytorch#167080

Merge remote-tracking branch 'upstream/main' into develop_IFU_20251106

d81ea9c

# Conflicts: # .ci/docker/ci_commit_pins/triton.txt

pragupta requested review from jataylo, jeffdaily, jithunnair-amd and pruthvistony as code owners November 6, 2025 17:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AUTOGENERATED] develop_IFU_20251106 #2789

[AUTOGENERATED] develop_IFU_20251106 #2789

pragupta commented Nov 6, 2025

Uh oh!

rocm-repo-management-api bot commented Nov 6, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

75 participants

[AUTOGENERATED] develop_IFU_20251106 #2789

Are you sure you want to change the base?

[AUTOGENERATED] develop_IFU_20251106 #2789

Conversation

pragupta commented Nov 6, 2025

Uh oh!

rocm-repo-management-api bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

75 participants

rocm-repo-management-api bot commented Nov 6, 2025 •

edited

Loading