forked from pytorch/pytorch
-
Notifications
You must be signed in to change notification settings - Fork 75
[AUTOGENERATED] develop_IFU_20251106 #2789
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
pragupta
wants to merge
131
commits into
develop
Choose a base branch
from
develop_IFU_20251106
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
+11,686
−3,731
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Summary: * Pull in `f4f4bf16` from FBGemm to provide MXFP4 support for CUDA * Add testing Test Plan: Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <[email protected]> Pull Request resolved: pytorch#166526 Approved by: https://github.com/drisspg, https://github.com/ngimel
Results from CI: No failures but generally takes longer, maybe ~20% increase in time? But the smaller runner is ~25% of the cost of the current runner, so in terms of cost this is a decrease If the 20% is too much, we can try the 4x larger runners, which are about half the cost of the current runner, so it would probably still result in cost savings with hopefully less impact to time Pull Request resolved: pytorch#164989 Approved by: https://github.com/BoyuanFeng, https://github.com/huydhn
You can just subtract timestamps, but this makes it easier Pull Request resolved: pytorch#166447 Approved by: https://github.com/Skylion007
We store a mapping between generated fx graph code and original model code stack trace in `fx.traceback._FX_METADATA_REGISTRY`. And we do a post-processing on the memory snapshot to append the original model stack trace information.
To achieve this, the biggest change we had to do in `aot_eager` mode is to give each generated fx graph a unique stack trace, i.e. it cannot just be `<eval_with_key>`. We set co_filename to **pretend** that the code is from `co_filename` file. Now instead of `<eval_with_key>` in stack trace, we get something like `fx_generated_3a4b5c6d7e8f9a0.py`.
`augment_with_fx_traces` arg is added to `torch.cuda.memory._snapshot` and `_dump_snapshot`. When the arg is set to True, a post-processing will run to populate the original model stack trace to the snapshot frames.
The new behavior of GraphModule can be controlled by `TORCH_ENRICH_RPOFILER_STACK_TRACE` or `_dynamo.config.enrich_profiler_metadata=True`.
Alternative:
Instead of setting co_filename, we can also do it like below:
Note that if we do it this way, we will need to dump the file to make the graph module torch-scriptable. TorchScript requires source access in order to carry out compilation, so we need to make sure original .py files are available.
```
key = filename
globals_copy = globals.copy()
globals_copy["__file__"] = key
globals_copy["__name__"] = key
linecache.lazycache(key, globals_copy)
exec(compile(src, key, "exec"), globals)
````
Other changes:
- Update `MemoryViz.js` to display fx node information and original model code if exist
```
python test/test_fx.py -k test_lineno_map
python test/test_fx.py -k test_custom_traceback_raised
python test/test_public_bindings.py
python test/test_cuda.py -k test_fx_memory
python test/test_fx.py -k test_informative_co_filename
python test/test_fx.py -k test_autowrap_functions
python test/dynamo/test_utils.py -k test_inductor_provenance
```
```python
# Profile with memory snapshot
torch.cuda.memory._record_memory_history()
with torch._dynamo.config.patch("enrich_profiler_stack_trace", True):
compiled = torch.compile(mod, backend="aot_eager", fullgraph=True)
result = compiled(torch.randn(10, 10, device="cuda:0"))
torch.cuda.memory._dump_snapshot("memory_snapshot.pickle", augment_with_fx_traces=True)
torch.cuda.memory._record_memory_history(enabled=None)
```
<img width="913" height="711" alt="Screenshot 2025-10-30 at 10 40 44 AM" src="https://github.com/user-attachments/assets/8d7a1833-f98d-4756-b666-1d63ab57b27b" />
Pull Request resolved: pytorch#166676
Approved by: https://github.com/albanD, https://github.com/ezyang
Instead of `(void) foo; // Unused parameter` trick, as this is a C++17 standard feature Will replace further repetitions of the same pattern soon after Pull Request resolved: pytorch#166865 Approved by: https://github.com/mikaylagawarecki, https://github.com/Skylion007, https://github.com/janeyx99
…uction consumer (pytorch#166165) Prefer unfused addmm when there is at least a single elemwise/reduction consumer.. Pull Request resolved: pytorch#166165 Approved by: https://github.com/eellison
…66467) This adds the capability to subproc pool to enable quiesce via a timer Pull Request resolved: pytorch#166467 Approved by: https://github.com/masnesral
The deprecation warning led to warning spamming in PyTorch APIs, like torch.compile. This is not how a deprecation warning should go: if we add a deprecation warning, we'd better update our built-in APIs to prevent warning spam. Pull Request resolved: pytorch#166956 Approved by: https://github.com/albanD
Fixed some syntax errors in SECURITY.md file including PyTorch's capitalization problems, some grammatical inconsistencies, etc Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#166718 Approved by: https://github.com/mikaylagawarecki
…#166961) This is a PR to temporarily relieve the queueing that is caused by an mi250 node outage. See this ticket for more information: pytorch#166866 It relaxes the GPU count check to allow distributed jobs to run on 2-GPU runners Pull Request resolved: pytorch#166961 Approved by: https://github.com/jeffdaily
… in CI (pytorch#165922) Fix and regression test for pytorch#165801 Pull Request resolved: pytorch#165922 Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/Skylion007, https://github.com/drisspg Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: Andrey Talman <[email protected]>
Pull Request resolved: pytorch#166976 Approved by: https://github.com/maggiemoss, https://github.com/Skylion007
…#166768) And simplify the entire function to just assert and return Pull Request resolved: pytorch#166768 Approved by: https://github.com/cyyever, https://github.com/atalman
Draft to expose compiled saved tensor hook context to selectively apply them. Exposing node, fw_graph, bw_graph. Pull Request resolved: pytorch#166887 Approved by: https://github.com/bdhirsh
…h#165036)" This reverts commit 0e1a889. Reverted pytorch#165036 on behalf of https://github.com/atalman due to regressed vllm signal: [GH job link](https://github.com/pytorch/pytorch/actions/runs/19059329909/job/54439919668) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/0e1a88904f4a5e30634b196678b56e1d6ec074f5) ([comment](pytorch#165036 (comment)))
…166669) For mix-order reduction, we current force XBLOCK to be 1 to simplify codegen. Don't tune it in CDT. Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [D86224689](https://our.internmc.facebook.com/intern/diff/D86224689) Pull Request resolved: pytorch#166669 Approved by: https://github.com/jansel, https://github.com/mlazos, https://github.com/eellison, https://github.com/v0i0
Noticed that workflow runs for `trunk/{sha}` tags (issued by autorevert) don't populate test_run_s3 Clickhouse table.
This PR is addressing this by changing the gate condition to upload tests stats.
see https://github.com/pytorch/pytorch/actions/runs/19054297956/job/54421254448#step:8:23
as an evidence that HEAD_BRANCH is correctly populated for trunk tags.
Pull Request resolved: pytorch#166916
Approved by: https://github.com/huydhn, https://github.com/clee2000
This includes sm103 triton-lang/triton#8485 fix Pull Request resolved: pytorch#166968 Approved by: https://github.com/Lucaskabela, https://github.com/njriasan
…ytorch#166973) Pull Request resolved: pytorch#166973 Approved by: https://github.com/eellison, https://github.com/jathu
So many times i build pytorch only to notice chef nuked my nvcc and i wasted 30m building a cpu version, lets hard error fast Pull Request resolved: pytorch#166982 Approved by: https://github.com/malfet ghstack dependencies: pytorch#166976
…166581) Major change is to switch to a timer based implementation. Additionally, we get rid of the context manager for turning of the compile pool. We still have the warmup calls. Note that this only modifies the async_compile methods, the fx pool is left running. Pull Request resolved: pytorch#166581 Approved by: https://github.com/masnesral ghstack dependencies: pytorch#166467
Fixes pytorch#159445 ### Summary - Fixed a stride layout issue in the `torch.linalg.eig` meta kernel that prevented successful compilation with the inductor backend. The meta kernel was producing incorrect row-major strides. - LAPACK/BLAS libraries (underlying implementation) expect column-major layout Pull Request resolved: pytorch#162484 Approved by: https://github.com/isuruf
…ects (pytorch#166917) Fixes pytorch#166900 Implementation notes: - I tried to disallow guard generation before side effect application in order to futureproof improper guard generation. However, this was not feasible since it is possible to realize lazy VTs while generating side effects (e.g. realizing a constant variable that is used in a deque update). - `codegen_save_tempvars` now generates `TempLocalSource` for create temporary variables now, so that they won't get confused with `LocalSource` - we should error out when we attempt to create guards for `TempLocalSource`. I considered using `SyntheticLocalSource`, but that has additional `subguards_allowed` behavior that we may not want to have for temp variables. - We moved the guard installation for constant user-defined pytree objects from `as_python_constant` to `__init__`. Objects created outside the compile-region will be guarded, while objects created inside the compile-region will not be guarded. Pull Request resolved: pytorch#166917 Approved by: https://github.com/anijain2305
Slice knows how to handle unbacked start, we do not need to offset start before calling slice, we can leave it for slice. The only edge case is when start<0 and start+length ==0 in that case slice and narrow would deviate, for that case we shall pass dim_size instead of start+length Pull Request resolved: pytorch#166361 Approved by: https://github.com/aorenste
Especially the job identifier can contain spaces so needs to be quoted Fixes e.g. https://github.com/pytorch/pytorch/actions/runs/19063797853/job/54449422160#step:15:52 Pull Request resolved: pytorch#166955 Approved by: https://github.com/Skylion007
Adds optional "node" id for tensors, output info annotations to DebugMode, with `DebugMode(record_output=True, record_ids=True)`
Example output for `test_debug_mode_mm`, with both enabled:
```
torch.mm(dt$0: f32[8, 8]| S(0), dt$1: f32[8, 32]| S(0)) -> dt$12: f32[8, 32]| S(0)
aten::mm(dt$2: f32[8, 8]| S(0), dt$3: f32[8, 32]| S(0))
redistribute_input(1, S(0) -> R)
redistribute_input(t$4: f32[1, 32], trace: S(0)->R)
_c10d_functional::all_gather_into_tensor(t$5: f32[1, 32], 8, 0) -> t$6: f32[8, 32]
_c10d_functional::wait_tensor(t$7: f32[8, 32]) -> t$8: f32[8, 32]
aten::mm(t$9: f32[1, 8], t$10: f32[8, 32]) -> t$11: f32[1, 32]
<method 'sum' of 'torch._C.TensorBase' objects>(dt$13: f32[8, 32]| S(0)) -> dt$17: f32[]| P
aten::sum(dt$14: f32[8, 32]| S(0))
aten::sum(t$15: f32[1, 32]) -> t$16: f32[]"""
```
Sadly the only way to get DTensor op outputs is to set `record_torchfunction=True`, as dispatch calls just defer to DTensor's dispatch logic.
Pull Request resolved: pytorch#165076
Approved by: https://github.com/zpcore
```python python test/test_fx.py -k profiler ``` Insert `torch._C._profiler._RecordFunctionFast` to fx graph codegen. We post-process the profiler dump using `map_recorded_events_to_aten_ops_with_stack_trace` to add the stack trace to the dump'd trace. `map_recorded_events_to_aten_ops_with_stack_trace` queries `fx.traceback._FX_METADATA_REGISTRY` for node metadata. Each graph module has a hash'd fake file name (e.g. `fx_generated__iv4zodvbcmdkhx77jrg7h2f2opebujhfmc6tf6nx7vioq244baw.py`), which is the key to the registry. One can do `fx_g.enrich_profiler_metadata()` to add debugging info. Or `fx_g.enrich_profiler_metadata(enable=False)` to remove. `aot_eager` makes calls `fx_g.enrich_profiler_metadata()` if TORCH_ENRICH_RPOFILER_STACK_TRACE is set or _dynamo.config.enrich_profiler_metadata=True. <img width="1188" height="565" alt="Screenshot 2025-10-31 at 4 40 52 PM" src="https://github.com/user-attachments/assets/41e8113f-3e6d-439b-bffd-cfbf0c03a47a" /> Example code gen'd. ``` def forward(self, args_list): args_iter = iter(args_list) arg0_1 = next(args_iter) arg1_1 = next(args_iter) args_list.clear() _rf = torch._C._profiler._RecordFunctionFast('## fx_generated__iv4zodvbcmdkhx77jrg7h2f2opebujhfmc6tf6nx7vioq244baw.py ##'); _rf.__enter__() repeated_subgraph0 = self.repeated_subgraph0 _rf_invoke_subgraph = torch._C._profiler._RecordFunctionFast('## 3 ##'); _rf_invoke_subgraph.__enter__() invoke_subgraph = torch.ops.higher_order.invoke_subgraph(repeated_subgraph0, 'subgraph_0', arg0_1, arg1_1); repeated_subgraph0 = arg0_1 = arg1_1 = None _rf_invoke_subgraph.__exit__(None, None, None) _rf_getitem = torch._C._profiler._RecordFunctionFast('## 4 ##'); _rf_getitem.__enter__() getitem = invoke_subgraph[0]; invoke_subgraph = None _rf_getitem.__exit__(None, None, None) return (getitem,) _rf.__exit__(None, None, None) def forward(self, arg0_1, arg1_1): _rf = torch._C._profiler._RecordFunctionFast('## fx_generated__ozpadpj5cxoalxeyopej33g2vvtvhxg4xsk7bhx7ldmcibtybyn.py ##'); _rf.__enter__() _rf_mul = torch._C._profiler._RecordFunctionFast('## 2 ##'); _rf_mul.__enter__() mul = torch.ops.aten.mul.Tensor(arg0_1, arg1_1); arg0_1 = arg1_1 = None _rf_mul.__exit__(None, None, None) _rf_sin = torch._C._profiler._RecordFunctionFast('## 3 ##'); _rf_sin.__enter__() sin = torch.ops.aten.sin.default(mul); mul = None _rf_sin.__exit__(None, None, None) _rf_add = torch._C._profiler._RecordFunctionFast('## 4 ##'); _rf_add.__enter__() add = torch.ops.aten.add.Tensor(sin, 5); sin = None _rf_add.__exit__(None, None, None) return (add,) _rf.__exit__(None, None, None) ``` Pull Request resolved: pytorch#166677 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#166676
…ytorch#166993) Combo kernel warns for long reduction and large pointwise. This becomes too spammy for users such as vLLM. This PR moves these logs from warn to debug. I validated the spammy log is removed on llama-3.1-8B. Pull Request resolved: pytorch#166993 Approved by: https://github.com/zou3519, https://github.com/eellison
This reverts commit e8052f2. Reverted pytorch#166677 on behalf of https://github.com/malfet due to Broke lint, please rebase, we've moved from mypy to pyrefly ([comment](pytorch#166677 (comment)))
pytorch#158081 Pull Request resolved: pytorch#166379 Approved by: https://github.com/Lucaskabela ghstack dependencies: pytorch#166361
This PR continues to fix or remove unused loop variables in tests. Pull Request resolved: pytorch#167043 Approved by: https://github.com/Lucaskabela
…rgs (pytorch#166368) Intended to make it easier to reuse this logic for processing operator arguments as IValues in following PR(s). Testing: python test/test_python_dispatch.py (broke during development, seems to work now) Pull Request resolved: pytorch#166368 Approved by: https://github.com/albanD
Previously the log only printed if the default implementation for an action was used, now it prints before dispatching to custom registered actions. Tested by running on autoparallel graph runner and observing forward pass action logged Pull Request resolved: pytorch#167113 Approved by: https://github.com/sanketpurandare, https://github.com/Skylion007
This PR adds return types of some Python functions. Most of them return `None`. The types were added automatically by ruff `ANN` rules. Pull Request resolved: pytorch#167162 Approved by: https://github.com/Lucaskabela
…7351) (pytorch#166622) ### Summary Adds a debug-level logging statement to torch.fx.Interpreter.run_node, as proposed in [pytorch#117351](pytorch#117351), to make FX graph execution traceable when debugging or instrumenting model transformations. When debug logging is enabled, each executed node emits a single structured log line formatted via `LazyString(lambda: n.format_node())`, deferring string construction unless logging is active. ### Example Output With `logging.DEBUG` enabled: ``` run_node x = x() run_node add = _operator.add(x, 1) run_node clamp = torch.clamp(add, min=0.0, max=5.0) run_node output = output(clamp) ``` With `logging.DEBUG` disabled no additional output is produced (unchanged default behavior). ### Test Plan Verified locally with Python 3.11 on macOS using a PyTorch build from source. - With `logging.DEBUG` enabled: each node emits a debug log via LazyString. - With `logging.DEBUG` disabled: no additional output. - Confirmed all `Interpreter` tests pass locally: `pytest test/test_fx.py -k "Interpreter"` Updated the example output to reflect the new `_format_fx_node` helper and inclusion of `kwargs`. Pull Request resolved: pytorch#166622 Approved by: https://github.com/aorenste
…167000) The previous PR was not enough to prevent errors caused by cpython dynamo tests in 3.14 Pull Request resolved: pytorch#167000 Approved by: https://github.com/mlazos, https://github.com/guilhermeleobas
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: pytorch#167031 Approved by: https://github.com/pytorchbot
Pull Request resolved: pytorch#167141 Approved by: https://github.com/Lucaskabela
Pull Request resolved: pytorch#167151 Approved by: https://github.com/Lucaskabela ghstack dependencies: pytorch#167141
Pull Request resolved: pytorch#167152 Approved by: https://github.com/Lucaskabela ghstack dependencies: pytorch#167141, pytorch#167151
…torch#166699) Exposing this flag as some upstream frameworks (like vLLM) could benefit from knowing whether torch.compile caches are enabled or not to adjust their own caching behavior. Pull Request resolved: pytorch#166699 Approved by: https://github.com/oulgen, https://github.com/mlazos
To pick a single change pytorch/tensorpipe@2b4cd91 that should fix compilation errors with clang-21 Pull Request resolved: pytorch#167108 Approved by: https://github.com/Skylion007
pytorch#158081 Pull Request resolved: pytorch#166379 Approved by: https://github.com/Lucaskabela ghstack dependencies: pytorch#166361
Add Inline Fusion Support for Custom Op Autotuning
--------------------------------------------------
This PR extends PyTorch Inductor's custom op autotuning with inline fusion capabilities, enabling the winning decomposition to be inlined directly into the computation graph for fusion with surrounding operations.
### Usage
```python
def decompose_k_implementation(
a: torch.Tensor, b: torch.Tensor, k_splits: int = 4
) -> torch.Tensor:
"""Matrix multiply with k-way decomposition."""
...
@torch.library.custom_op("my_lib::matmul_relu", mutates_args={})
def custom_matmul_relu_dk(
a: torch.Tensor, b: torch.Tensor, k_splits: int
) -> torch.Tensor:
return torch.relu(decompose_k_implementation(a, b, k_splits))
register_custom_op_autotuning(
custom_op=custom_matmul_relu_dk,
configs=[
CustomOpConfig(k_splits=2),
CustomOpConfig(k_splits=4),
CustomOpConfig(k_splits=8),
CustomOpConfig(k_splits=32),
CustomOpConfig(k_splits=64),
],
name="decompose_k_autotuned",
input_gen_fns={
"a": lambda fake: torch.randn_like(fake, device='cuda'),
"b": lambda fake: torch.randn_like(fake, device='cuda'),
}
)
```
### How It Works
Enable optimizations from Inductor by inlining the best decomposition, allowing fusion with surrounding elementwise operations and other graph-level optimizations. This provide potentially better performance and memory efficiency.
During customop autotuning phase, we still benchmarks all CustomOpConfigs to find the fastest implementation. Then during inline fusion, inductor inline the decompositions into the main graph, converting the winning choice to individual ComputedBuffer IR nodes (fusable). At the end, Inductor automatically fuses inlined operations with surrounding elementwise ops (e.g., bias add, ReLU, scaling). Note that the winning choice must be a SubgraphChoiceCaller (decomposition-based) rather than an ExternKernelChoice for inlining to work. If the ExternKernelChoice is returned, no inline happens.
Performance Results
Benchmarked on matmul+relu workload with decompose-k fusion (H100 GPU, 15 test shapes):
<img width="782" height="377" alt="Screenshot 2025-11-04 at 12 43 11 AM" src="https://github.com/user-attachments/assets/22131d4c-a8ce-4f55-bdcd-ac758ddad8cd" />
Metric | Result
-- | --
Average Speedup vs ATen | 1.28x
Max Speedup vs ATen | 1.41x
<br class="Apple-interchange-newline">
The performance comparison are detailed in the below plots. We spot that on most use cases, the inline fusion gains better performance compared to aten baseline and the current torch.compile.
<img width="4874" height="3545" alt="image" src="https://github.com/user-attachments/assets/190a1233-412f-4f34-84cd-9b7cb582f504" />
**Test**: `test_decompose_k_with_fusion` demonstrates decompose-k with inline fusion enabled.
--------------
### Integration to mm.py decomposeK with a flag enable_inline_subgraph_fusion=True in config (deprecated to avoid breaking async compilation. removed from the PR already)
FP32:
<img width="738" height="357" alt="Screenshot 2025-11-04 at 12 05 08 AM" src="https://github.com/user-attachments/assets/ee421d22-c426-42f2-8dcd-4dcc547d6219" />
FP16:
<img width="769" height="403" alt="Screenshot 2025-11-04 at 12 13 49 AM" src="https://github.com/user-attachments/assets/346d1ffc-15af-40b0-9378-cf9b297711c2" />
The TCF column represents torch compile fusion, which is close to custom_op decomposek. The difference might due to different candidate k values.
#### Usage:
Note: this only happens when we don't benchmark_epilogue_fusion, i.e., not using multi_template_buffer.
```python
# Define the matmul+relu function
def matmul_relu(x, y):
return torch.nn.functional.relu(torch.matmul(x, y))
# Compile with inline subgraph fusion enabled
@torch.compile
def compiled_matmul_relu(x, y):
return matmul_relu(x, y)
# Reset dynamo to ensure clean compilation
torch._dynamo.reset()
with config.patch(
{
"max_autotune": True,
# CRITICAL: These two flags enable inline subgraph fusion
"benchmark_epilogue_fusion": False, # Must be False for inline fusion!
"enable_inline_subgraph_fusion": True, # Enable inline fusion
}
):
# Compile and run
result = compiled_matmul_relu(a, b)
torch.cuda.synchronize()
```
Pull Request resolved: pytorch#165952
Approved by: https://github.com/PaulZhang12, https://github.com/eellison
Error in Helion CI's AMD job: https://github.com/pytorch/helion/actions/runs/19118581048/job/54633730633 ``` > (binary.metadata.num_ctas, *binary.metadata.cluster_dims) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ if hasattr(binary, "metadata") else () ) ), "function": get_first_attr(binary, "function", "cu_function"), "runner": get_first_attr(binary, "run", "c_wrapper"), "math": math_lib, "torch": torch_lib, "triton": triton_lib, } E torch._inductor.exc.InductorError: AttributeError: 'KernelMetadata' object has no attribute 'cluster_dims' ``` Pull Request resolved: pytorch#167187 Approved by: https://github.com/oulgen
This PR uses `key in dict` expressions for existence checks of dict elements in Python code. This operation is more efficient than `key in dict.keys()`. Pull Request resolved: pytorch#167174 Approved by: https://github.com/mlazos
This PR continues to fix or remove unused loop variables in tests. Pull Request resolved: pytorch#166921 Approved by: https://github.com/mlazos
…6951)" This reverts commit a74fe75. Reverted pytorch#166951 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#166951 (comment)))
…ass (pytorch#167080) Split of pytorch#162469 to be under 2K reorder iterative part Pull Request resolved: pytorch#167080 Approved by: https://github.com/eellison
…167089) Summary: Add MTIA as a native device type in PyTorch. Test Plan: CI Reviewed By: PatriceVignola Differential Revision: D80111801 Pull Request resolved: pytorch#167089 Approved by: https://github.com/andyanwang, https://github.com/nautsimon, https://github.com/albanD
…61035) torch.cuda.memory.set_per_process_memory_fraction allows setting an upper bound on how much device memory is allocated. This PR exposes this setting to an environment variable. For example, PYTORCH_CUDA_ALLOC_CONF="per_process_memory_fraction:0.5" will limit the device memory to half of the available memory. Pull Request resolved: pytorch#161035 Approved by: https://github.com/ngimel, https://github.com/eqy
…rch#167081) Split of pytorch#162469 to be under 2K reorder iterative part Pull Request resolved: pytorch#167081 Approved by: https://github.com/eellison ghstack dependencies: pytorch#167080
I.e. remove distinction between two cases, and always preload full set of libraries For some reason, when one uses `virtualenv` instead of `venv`, preloading `cudart` works, but it fails to find cudnn or cublasLT later on Fix it, by getting read of partial preload logic for one of the cases and always preload full set of libraries Test plan on stock Ubuntu: ``` pip install virtualenv virtualenv --symlinks -p python3.11 --prompt virtv venv-virt source venv-virt/bin/activate pip install torch python -c 'import torch' ``` Fixes pytorch#165812 Pull Request resolved: pytorch#167046 Approved by: https://github.com/atalman
…n weight is sparse (pytorch#166071) As per title. It seems safe to be able to generalize to arbitrary contiguous inputs since `at::matmul` is likely to do the flattening to avoid `baddmm`. Additionally, we guard for bias to be 1D and contiguous which is guaranteed to be fused with no copies. Pull Request resolved: pytorch#166071 Approved by: https://github.com/ngimel
# Conflicts: # .ci/docker/ci_commit_pins/triton.txt
|
Jenkins build for d81ea9c2635396625aef23cbae9ff6f6e373df0c commit finished as FAILURE |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
rocm_base: 3d74218