Releases · pytorch/helion

21 Aug 20:40

oulgen

v0.1.1

2e1ea33

v0.1.1 Latest

Latest

What's Changed

[Benchmark] Avoid using _run in TritonBench integration by @yf225 in #444
Add H100 CI by @oulgen in #435
Add B200 CI by @oulgen in #436
Skip illegal memory access for autotuning by @oulgen in #453
Re-enable associative_scan tests in ref eager mode by @yf225 in #443
Fix tritonbench integration issue by @yf225 in #463
[Benchmark] Allow passing kwargs; Set static_shape = True for better benchmark perf by @yf225 in #465
[Example] One shot all reduce by @joydddd in #245
Fix lint by @oulgen in #469
Improve signal/wait doc by @joydddd in #478
Cleanup ci by @oulgen in #449
Run CI on mi325x by @oulgen in #441
Improve Stacktensor Doc by @joydddd in #479
Require tests to be faster than 30s by @oulgen in #471
Improve error message when no good config is found by @oulgen in #455
Add SequenceType Eq comparison by @oulgen in #482
[Benchmark] Add try-catch for tritonbench import path by @yf225 in #487
Add helion prefix to Triton kernel name by @yf225 in #486
Support GraphModule inputs by @jansel in #488
Improve stack trace for #457 by @jansel in #489
[EZ] Replace pytorch-labs with meta-pytorch by @ZainRizvi in #490
[generate_ast] providing AST args, and fall back to api._codegen when output is a tuple by @HanGuo97 in #481
Support reshape with block_size expressions by @yf225 in #495
[example] add jagged_softmax example by @pianpwk in #480
Fix handling of fixed size reductions by @jansel in #499
Improve error message for rank mismatch in control flow by @jansel in #502
Fix reshape + sum case by @yf225 in #504
Sort config keys alphabetically in __str__ by @yf225 in #505
Fix issue with fp64 constants by @jansel in #506

New Contributors

@ZainRizvi made their first contribution in #490
@HanGuo97 made their first contribution in #481
@pianpwk made their first contribution in #480

Full Changelog: v0.1.0...v0.1.1

Contributors

jansel, oulgen, and 5 other contributors

Assets 2

06 Aug 20:37

oulgen

v0.1.0

f105b05

v0.1.0

What's Changed

Ref-eager and normal modes can share cache by @oulgen in #421
Helion examples by @sekyondaMeta in #401
Add extensive setter/getter unit tests for indexed tensor; fix bugs discovered by new tests by @yf225 in #422
Always set triton allocator by @jansel in #416
Add stacked tensor by @joydddd in #346
Change references to pytorch-labs to pytorch by @oulgen in #430
[BC breaking] Add StackTensor support to hl.signal & hl.wait (as_ptrs) by @joydddd in #261
Fix: test/test_signal_pad by @joydddd in #432
Fix test/test_stack_tensor.py by @oulgen in #431
Skip associative_scan tests in ref eager mode by @yf225 in #433
Fix scalar value assignment to tensor slices by @yf225 in #424
Fix scalar tensor broadcasting in type propagation by @yf225 in #425
Fix strided slice support for static slices (e.g., buf[::2]) by @yf225 in #426
Better fix for triton allocator error by @jansel in #427
Make bullet points clickable by @sekyondaMeta in #428

New Contributors

@sekyondaMeta made their first contribution in #401

Full Changelog: v0.0.12...v0.1.0

Contributors

jansel, oulgen, and 3 other contributors

Assets 2

04 Aug 21:31

oulgen

v0.0.12

642836c

v0.0.12

What's Changed

Use autotuner's BoundKernel in caching by @oulgen in #388
Temporarily move triton_key import to inner function to unblock older torch versions by @oulgen in #395
[Examples] Add matmul variants with bias support and tests by @yf225 in #379
[Benchmark] Support kernel variants; setup matmul tritonbench integration by @yf225 in #380
Relax tolerance for test_input_float16_acc_float16_dynamic_shape by @yf225 in #399
[Benchmark] Enable CSV output; clean up benchmark hot path by @yf225 in #398
[Benchmark] Move per-operator settings from example file to benchmarks/run.py by @yf225 in #403
[Ref Mode] PyTorch reference mode (eager only) by @yf225 in #339
Clean up CI & fix caching by @oulgen in #408
Add autotuner_fn argument to @helion.kernel for custom autotuners by @oulgen in #394
Fix non-tuple indexing warning by @jansel in #411
Add support for listcomp by @jansel in #412
Fix allow_tf32 warning by @jansel in #413
Update URL to helionlang.com by @jansel in #414
Add metaclass [] syntax for cache classes by @jansel in #415
[Ref Mode] Expand ref eager mode support to more hl.* APIs (e.g. load / store / scan / reduce) by @yf225 in #410
[Benchmark] Fix arg parsing issue in tritonbench integration by @yf225 in #417

Full Changelog: v0.0.11...v0.0.12

Contributors

jansel, oulgen, and yf225

Assets 2

28 Jul 21:45

oulgen

v0.0.11

3c3c64a

v0.0.11

What's Changed

Add tl._experimental_make_tensor_descriptor restrictions by @oulgen in #331
Skip accuracy check for test_moe_matmul_ogs by @yf225 in #333
Do not create a new variable for tile assignments since tiles are immutable by @oulgen in #334
Clean pyright warning by @oulgen in #335
Run lint with nightly (match test CI) by @oulgen in #349
Refactor BoundKernel in memory caching by @oulgen in #351
Allow string literal args by @jansel in #353
Fix issue with integer in rolled reduction by @jansel in #354
Fix test_fp8_attention expected by @jansel in #355
Write test workflow without pt deps by @oulgen in #352
Refresh the action cache once a month by @oulgen in #362
Use bare nvidia cuda docker image by @oulgen in #363
Inline install triton by @oulgen in #364
Remove triton's conda deps by @oulgen in #365
Make lint workflow leaner by @oulgen in #366
Swap from conda to uv for lint workflow by @oulgen in #367
Swap from conda to uv on test workflow by @oulgen in #368
Fix fp16 var_mean multi-output issue by @jansel in #357
Add fallbacks for unary ops that don't support fp16 by @jansel in #361
Name the cache step so we can check its outputs by @oulgen in #369
Fix pyright errors by @oulgen in #370
Only use Tensor Descriptor indexing with appropriate shapes by @PaulZhang12 in #360
Remerge LayerNorm (#348) by @PaulZhang12 in #373
Do not crash autotuner on more triton/llvm/cuda errors seen on B200 by @oulgen in #374
Set MAX_JOBS=4 for tritonbench build to avoid OOM by @yf225 in #376
[Benchmark] Allow running a specific shard of input via --input-shard M/N cli arg by @yf225 in #377
Use venv for pip install on lint by @oulgen in #381
[RFC] Implement basic on disk caching by @oulgen in #336
Add hl.dot() API; Use hl.dot instead of torch.matmul for FP8 GEMM ops in Helion kernel by @yf225 in #356
Fix test_inline_asm_packed expected output due to upstream PyTorch change by @yf225 in #385

Full Changelog: v0.0.10...v0.0.11

Contributors

jansel, oulgen, and 2 other contributors

Assets 2

17 Jul 18:14

oulgen

v0.0.10

7d01817

v0.0.10

What's Changed

[Benchmark] Add initial TritonBench integration and vector_add benchmark example by @yf225 in #247
Add static_range by @joydddd in #235
Cleanup/improve docstrings by @jansel in #250
[Benchmark] Add embedding benchmark by @yf225 in #248
[Benchmark] Add vector_exp benchmark by @yf225 in #249
Add rms_norm example and test by @yf225 in #252
[Benchmark] Add rms_norm benchmark by @yf225 in #253
Strip extra newlines from *.expected files by @jansel in #255
Fix issue with BLOCK_SIZE0.to(torch.int32) by @jansel in #254
Add hl.wait & AllGather Matmul example (via hl_ext helper). by @joydddd in #189
Add sum example and test by @yf225 in #256
[Benchmark] Add sum to TritonBench integration by @yf225 in #257
Rename benchmark folder by @yf225 in #258
Add hl.signal by @joydddd in #233
Add hl.wait for simultenous waiting for multiple gmem barriers by @joydddd in #243
Swap to using pyright by @oulgen in #259
Fix pyright errors in type_propagation.py by @yf225 in #266
[BE] Add spellchecker by @oulgen in #265
Remove pyre-ignore/pyre-fixme calls by @jansel in #274
Improve typing for helion.kernel by @jansel in #270
Add jagged_mean example by @yf225 in #263
[Benchmark] Add jagged_mean tritonbench integration by @yf225 in #264
Add fp8_gemm example and test by @yf225 in #267
[Benchmark] Add fp8_gemm to TritonBench integration by @yf225 in #268
Fix some pyright errors by @jansel in #276
Remove unused exception types by @jansel in #271
Fix docstring see also lists by @jansel in #272
[benchmarks] Change tritonbench api by @xuzhao9 in #260
Initial versison of documentation by @jansel in #273
Deploy docs to github pages by @jansel in #277
Fix lint error on main by @jansel in #281
Add a link to the documentation by @jansel in #282
[Benchmark] Fix tritonbench integration due to upstream changes by @yf225 in #278
[Benchmark] Allow using 'python benchmarks/run.py' to run all kernels by @yf225 in #280
Add implicit broadcasting tests by @jansel in #285
Add additional tl.range choices to persistent loop by @jansel in #287
Update autotuning example in docs by @jansel in #288
Add host side dead code elimination by @oulgen in #289
[Benchmark] Add attention tritonbench integration by @yf225 in #284
Add helion.exc.CannotModifyHostVariableOnDevice and helion.exc.CannotReadDeviceVariableOnHost by @jansel in #290
Fix unstable CI by @jansel in #299
Make to_triton_code config arg optional by @jansel in #291
Add helion.exc.DeviceTensorSubscriptAssignmentNotAllowed by @jansel in #292
Remove default configs from examples by @jansel in #295
Fix bug with tensor descriptor and small block size by @jansel in #296
Relax typing for CombineFunction by @jansel in #297
Add examples/segment_reduction.py by @jansel in #300
Add error for using a host tensor directly by @jansel in #306
Improve Tensor.item() handling by @jansel in #307
Fix type_info null errors by @oulgen in #294
Improve DCE by marking math functions as pure by @oulgen in #312
[Benchmark] Add softmax tritonbench integration by @yf225 in #286
Make imports relative by @jansel in #310
Generalize l2_grouping to support 3+ dimensions by @jansel in #313
Remove make_precompiler generated wrapper by @jansel in #314
Enforce ANN/PGH lints by @jansel in #315
Support dynamic fill value to hl.full by @jansel in #316
Use tensor device reference in persistent kernels by @jansel in #317
Add tl._experimental_make_tensor_descriptor support by @oulgen in #322
Fix variable scoping in nested loops for multi-pass kernels by @yf225 in #324
Add HELION_DEV_LOW_VRAM env var for low GPU memory machines by @yf225 in #325
Add cross_entropy example and unit test by @yf225 in #320
[Benchmark] Add cross_entropy to tritonbench integration by @yf225 in #321
Add literal index into tuple by @joydddd in #327
Improve naming for generated helper functions by @jansel in #323
Add hl.inline_asm_elementwise by @jansel in #328
Implement static tuple unrolling and hl.static_range by @jansel in #329
Add fp8_attention example and unit test by @yf225 in #318
[Benchmark] Add fp8_attention to tritonbench integration by @yf225 in #319

New Contributors

@xuzhao9 made their first contribution in #260

Full Changelog: v0.0.9...v0.0.10

Contributors

xuzhao9, jansel, and 3 other contributors

Assets 2

08 Jul 19:27

jansel

v0.0.9

902741b

v0.0.9

What's Changed

Add tl.range warp_specialize to autotuner by @jansel in #230
Switch from TensorDescriptor to tl.make_tensor_descriptor by @jansel in #232
Enable Test fixed by Fixed by #195 by @joydddd in #236
Implement persistent kernels by @jansel in #238
Add hl.associative_scan by @jansel in #239
Fix failing tests on main by @jansel in #244
Add hl.reduce by @jansel in #240
Switch from expecttest/assertExpectedInline to assertExpectedJournal by @jansel in #241

Full Changelog: v0.0.8...v0.0.9

Contributors

jansel and joydddd

Assets 2

01 Jul 15:16

jansel

v0.0.8

43faf72

v0.0.8

What's Changed

Improve loop end bound optimization for nested tiling by @jansel in #192
Set default dot_precision to TRITON_F32_DEFAULT by @jansel in #197
Use _disable_flatten_get_tile helper in tile_id by @jansel in #200
Throw type errors immediately by @jansel in #202
Fix typo in LiteralType.merge by @jansel in #201
Add support for global statements in type propagation by @jansel in #203
Remove ErrorReporting class and simplify warning handling by @jansel in #204
Add InvalidDeviceForLoop exception type by @jansel in #205
Fix bug with renamed variable flowing into phi() node by @jansel in #206
Move hl.grid tests to their own file by @jansel in #208
Remove NDGridTileStrategy by @jansel in #209
Simplify codegen for hl.grid by @jansel in #210
Add support for hl.grid(begin, end, step) by @jansel in #211
Support range() loops (alias for hl.grid) by @jansel in #212
Move yz_grid disabling logic to ConfigSpec by @jansel in #213
Relax chebyshev kernel test tolerance by @jansel in #214
[RFC] Add static loop unrolling by @oulgen in #216
Add support for torch.arange by @jansel in #215
Fix a performance issue with Helion-emitted Flash Attention by @manman-ren in #181
Fix issue with phi nodes and aliasing by @jansel in #220
Fix duplicate argument handling in inductor lowering by @jansel in #222
x[i] returns scalar when i=scalar by @joydddd in #223
Fix config flatten spec for tile.id by @joydddd in #224
Fix failing tests on main by @jansel in #231
Refactor examples to use run_example helper by @jansel in #225
Add tl.range loop_unroll_factor to autotuner by @jansel in #226
Add tl.range num_stages to autotuner by @jansel in #227
Add tl.range disallow_acc_multi_buffer to autotuner by @jansel in #228
Add tl.range flatten to autotuner by @jansel in #229

New Contributors

@manman-ren made their first contribution in #181

Full Changelog: v0.0.7...v0.0.8

Contributors

jansel, oulgen, and 2 other contributors

Assets 2

18 Jun 18:06

oulgen

v0.0.7

248ece6

v0.0.7

What's Changed

Fix bug with computations based on hl.register_block_size by @jansel in #157
Generalize workaround for unbacked size hints by @jansel in #159
Don't hardcode cuda in test files by @jansel in #160
Move register_block_size/register_reduction_dim to tunable_ops.py by @jansel in #161
Unskip some previosly failing tests by @jansel in #162
Use workflow matrix to deduplicate code by @oulgen in #168
Rename TileIndexProxy to hl.Tile by @jansel in #171
Fix block size variable handling and atomic operations with symints by @jansel in #177
Codegen if tl.sum(one_elem_tensor): instead of if one_elem_tensor by @yf225 in #158
Fix visitCall in deviceIR. Always visit argument nodes by @joydddd in #180
Relax bounds on test_mask_dot by @oulgen in #182
Add lowering for Constant assignment by @joydddd in #187
Expose tile.id by @joydddd in #188
Do not precompile set configs by @oulgen in #183
Add option to ban/disallow autotuning by @oulgen in #184
Recommend PyTorch nightly build in readme by @jansel in #193
Fix issue with ConfigSpec mutation in codegen by @jansel in #195
enable_python_dispatcher() in propagate_types by @laithsakka in #191

New Contributors

@laithsakka made their first contribution in #191

Full Changelog: v0.0.6...v0.0.7

Contributors

jansel, oulgen, and 3 other contributors

Assets 2

12 Jun 20:42

oulgen

v0.0.6

b9e93c0

v0.0.6

What's Changed

Fix ast read writes by @oulgen in #148
Update pre-commit by @oulgen in #149
Try enable test_moe_matmul_ogs on CI by @yf225 in #147
[Ready for review] Add support for print(prefix_str, *tensors) by @yf225 in #140
Support hl.tile_{begin,end,block_size} by @jansel in #150
Rename TileStrategy.get_block_index to CompileEnvironment.get_block_id by @jansel in #151
Fix bug in merging sequence types by @jansel in #152
Increase atol for test_matmul_split_k by @jansel in #155
Fix bug in test_matmul_split_k by @jansel in #156
Add hl.register_tunable by @jansel in #154

Full Changelog: v0.0.5...v0.0.6

Contributors

jansel, oulgen, and yf225

Assets 2

09 Jun 15:53

oulgen

v0.0.5

9a9f3e7

v0.0.5

What's Changed

Rename linter/check_main.py -> scripts/lint_examples_main.py by @jansel in #124
Improve error message for unpacking a tile by @jansel in #125
Improve error message for overpacked tiles by @jansel in #126
[BC breaking] Simplify block size configs by @jansel in #127
Refactor reduction loop config spec by @jansel in #128
Move BlockIdSequence to its own file by @jansel in #129
Do not print output code durring autotuning by @jansel in #130
Make helion.exc.TensorOperationInWrapper not fire on non-torch ops by @jansel in #131
Add HELION_FORCE_AUTOTUNE=1 and update readme by @jansel in #132
Correct units for time printouts by @jansel in #133
Rename block_size_idx to block_id by @jansel in #134
Rename block_indices to block_ids by @jansel in #135
Add Pyre Pre-Commit Hook by @lolpack in #136
Update .pre-commit-config.yaml by @oulgen in #137
[Ready for review] Add hl.register_reduction_dim(); add support for matmul+layernorm example by @yf225 in #80
Fix bug with errors on unreachable if branch by @jansel in #138
[Error Message] Update block config size length mismatch by @drisspg in #139
Increase atol/rtol for test_error_in_non_taken_branch by @jansel in #142
Fix some typos by @jansel in #141
More fair comparison by @drisspg in #146

New Contributors

@lolpack made their first contribution in #136

Full Changelog: v0.0.4...v0.0.5

Contributors

jansel, lolpack, and 3 other contributors

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What's Changed

New Contributors

Contributors

Uh oh!

What's Changed

New Contributors

Contributors

Uh oh!

What's Changed

Contributors

Uh oh!

What's Changed

Contributors

Uh oh!

What's Changed

New Contributors

Contributors

Uh oh!

What's Changed

Contributors

Uh oh!

What's Changed

New Contributors

Contributors

Uh oh!

What's Changed

New Contributors

Contributors

Uh oh!

What's Changed

Contributors

Uh oh!

What's Changed

New Contributors

Contributors

Uh oh!

Releases: pytorch/helion

v0.1.1

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.0

What's Changed

New Contributors

Contributors

Uh oh!

v0.0.12

What's Changed

Contributors

Uh oh!

v0.0.11

What's Changed

Contributors

Uh oh!

v0.0.10

What's Changed

New Contributors

Contributors

Uh oh!

v0.0.9

What's Changed

Contributors

Uh oh!

v0.0.8

What's Changed

New Contributors

Contributors

Uh oh!

v0.0.7

What's Changed

New Contributors

Contributors

Uh oh!

v0.0.6

What's Changed

Contributors

Uh oh!

v0.0.5

What's Changed

New Contributors

Contributors

Uh oh!