Skip to content

Releases: pytorch/helion

v0.1.1

21 Aug 20:40
2e1ea33
Compare
Choose a tag to compare

What's Changed

  • [Benchmark] Avoid using _run in TritonBench integration by @yf225 in #444
  • Add H100 CI by @oulgen in #435
  • Add B200 CI by @oulgen in #436
  • Skip illegal memory access for autotuning by @oulgen in #453
  • Re-enable associative_scan tests in ref eager mode by @yf225 in #443
  • Fix tritonbench integration issue by @yf225 in #463
  • [Benchmark] Allow passing kwargs; Set static_shape = True for better benchmark perf by @yf225 in #465
  • [Example] One shot all reduce by @joydddd in #245
  • Fix lint by @oulgen in #469
  • Improve signal/wait doc by @joydddd in #478
  • Cleanup ci by @oulgen in #449
  • Run CI on mi325x by @oulgen in #441
  • Improve Stacktensor Doc by @joydddd in #479
  • Require tests to be faster than 30s by @oulgen in #471
  • Improve error message when no good config is found by @oulgen in #455
  • Add SequenceType Eq comparison by @oulgen in #482
  • [Benchmark] Add try-catch for tritonbench import path by @yf225 in #487
  • Add helion prefix to Triton kernel name by @yf225 in #486
  • Support GraphModule inputs by @jansel in #488
  • Improve stack trace for #457 by @jansel in #489
  • [EZ] Replace pytorch-labs with meta-pytorch by @ZainRizvi in #490
  • [generate_ast] providing AST args, and fall back to api._codegen when output is a tuple by @HanGuo97 in #481
  • Support reshape with block_size expressions by @yf225 in #495
  • [example] add jagged_softmax example by @pianpwk in #480
  • Fix handling of fixed size reductions by @jansel in #499
  • Improve error message for rank mismatch in control flow by @jansel in #502
  • Fix reshape + sum case by @yf225 in #504
  • Sort config keys alphabetically in __str__ by @yf225 in #505
  • Fix issue with fp64 constants by @jansel in #506

New Contributors

Full Changelog: v0.1.0...v0.1.1

v0.1.0

06 Aug 20:37
f105b05
Compare
Choose a tag to compare

What's Changed

  • Ref-eager and normal modes can share cache by @oulgen in #421
  • Helion examples by @sekyondaMeta in #401
  • Add extensive setter/getter unit tests for indexed tensor; fix bugs discovered by new tests by @yf225 in #422
  • Always set triton allocator by @jansel in #416
  • Add stacked tensor by @joydddd in #346
  • Change references to pytorch-labs to pytorch by @oulgen in #430
  • [BC breaking] Add StackTensor support to hl.signal & hl.wait (as_ptrs) by @joydddd in #261
  • Fix: test/test_signal_pad by @joydddd in #432
  • Fix test/test_stack_tensor.py by @oulgen in #431
  • Skip associative_scan tests in ref eager mode by @yf225 in #433
  • Fix scalar value assignment to tensor slices by @yf225 in #424
  • Fix scalar tensor broadcasting in type propagation by @yf225 in #425
  • Fix strided slice support for static slices (e.g., buf[::2]) by @yf225 in #426
  • Better fix for triton allocator error by @jansel in #427
  • Make bullet points clickable by @sekyondaMeta in #428

New Contributors

Full Changelog: v0.0.12...v0.1.0

v0.0.12

04 Aug 21:31
642836c
Compare
Choose a tag to compare

What's Changed

  • Use autotuner's BoundKernel in caching by @oulgen in #388
  • Temporarily move triton_key import to inner function to unblock older torch versions by @oulgen in #395
  • [Examples] Add matmul variants with bias support and tests by @yf225 in #379
  • [Benchmark] Support kernel variants; setup matmul tritonbench integration by @yf225 in #380
  • Relax tolerance for test_input_float16_acc_float16_dynamic_shape by @yf225 in #399
  • [Benchmark] Enable CSV output; clean up benchmark hot path by @yf225 in #398
  • [Benchmark] Move per-operator settings from example file to benchmarks/run.py by @yf225 in #403
  • [Ref Mode] PyTorch reference mode (eager only) by @yf225 in #339
  • Clean up CI & fix caching by @oulgen in #408
  • Add autotuner_fn argument to @helion.kernel for custom autotuners by @oulgen in #394
  • Fix non-tuple indexing warning by @jansel in #411
  • Add support for listcomp by @jansel in #412
  • Fix allow_tf32 warning by @jansel in #413
  • Update URL to helionlang.com by @jansel in #414
  • Add metaclass [] syntax for cache classes by @jansel in #415
  • [Ref Mode] Expand ref eager mode support to more hl.* APIs (e.g. load / store / scan / reduce) by @yf225 in #410
  • [Benchmark] Fix arg parsing issue in tritonbench integration by @yf225 in #417

Full Changelog: v0.0.11...v0.0.12

v0.0.11

28 Jul 21:45
3c3c64a
Compare
Choose a tag to compare

What's Changed

  • Add tl._experimental_make_tensor_descriptor restrictions by @oulgen in #331
  • Skip accuracy check for test_moe_matmul_ogs by @yf225 in #333
  • Do not create a new variable for tile assignments since tiles are immutable by @oulgen in #334
  • Clean pyright warning by @oulgen in #335
  • Run lint with nightly (match test CI) by @oulgen in #349
  • Refactor BoundKernel in memory caching by @oulgen in #351
  • Allow string literal args by @jansel in #353
  • Fix issue with integer in rolled reduction by @jansel in #354
  • Fix test_fp8_attention expected by @jansel in #355
  • Write test workflow without pt deps by @oulgen in #352
  • Refresh the action cache once a month by @oulgen in #362
  • Use bare nvidia cuda docker image by @oulgen in #363
  • Inline install triton by @oulgen in #364
  • Remove triton's conda deps by @oulgen in #365
  • Make lint workflow leaner by @oulgen in #366
  • Swap from conda to uv for lint workflow by @oulgen in #367
  • Swap from conda to uv on test workflow by @oulgen in #368
  • Fix fp16 var_mean multi-output issue by @jansel in #357
  • Add fallbacks for unary ops that don't support fp16 by @jansel in #361
  • Name the cache step so we can check its outputs by @oulgen in #369
  • Fix pyright errors by @oulgen in #370
  • Only use Tensor Descriptor indexing with appropriate shapes by @PaulZhang12 in #360
  • Remerge LayerNorm (#348) by @PaulZhang12 in #373
  • Do not crash autotuner on more triton/llvm/cuda errors seen on B200 by @oulgen in #374
  • Set MAX_JOBS=4 for tritonbench build to avoid OOM by @yf225 in #376
  • [Benchmark] Allow running a specific shard of input via --input-shard M/N cli arg by @yf225 in #377
  • Use venv for pip install on lint by @oulgen in #381
  • [RFC] Implement basic on disk caching by @oulgen in #336
  • Add hl.dot() API; Use hl.dot instead of torch.matmul for FP8 GEMM ops in Helion kernel by @yf225 in #356
  • Fix test_inline_asm_packed expected output due to upstream PyTorch change by @yf225 in #385

Full Changelog: v0.0.10...v0.0.11

v0.0.10

17 Jul 18:14
7d01817
Compare
Choose a tag to compare

What's Changed

  • [Benchmark] Add initial TritonBench integration and vector_add benchmark example by @yf225 in #247
  • Add static_range by @joydddd in #235
  • Cleanup/improve docstrings by @jansel in #250
  • [Benchmark] Add embedding benchmark by @yf225 in #248
  • [Benchmark] Add vector_exp benchmark by @yf225 in #249
  • Add rms_norm example and test by @yf225 in #252
  • [Benchmark] Add rms_norm benchmark by @yf225 in #253
  • Strip extra newlines from *.expected files by @jansel in #255
  • Fix issue with BLOCK_SIZE0.to(torch.int32) by @jansel in #254
  • Add hl.wait & AllGather Matmul example (via hl_ext helper). by @joydddd in #189
  • Add sum example and test by @yf225 in #256
  • [Benchmark] Add sum to TritonBench integration by @yf225 in #257
  • Rename benchmark folder by @yf225 in #258
  • Add hl.signal by @joydddd in #233
  • Add hl.wait for simultenous waiting for multiple gmem barriers by @joydddd in #243
  • Swap to using pyright by @oulgen in #259
  • Fix pyright errors in type_propagation.py by @yf225 in #266
  • [BE] Add spellchecker by @oulgen in #265
  • Remove pyre-ignore/pyre-fixme calls by @jansel in #274
  • Improve typing for helion.kernel by @jansel in #270
  • Add jagged_mean example by @yf225 in #263
  • [Benchmark] Add jagged_mean tritonbench integration by @yf225 in #264
  • Add fp8_gemm example and test by @yf225 in #267
  • [Benchmark] Add fp8_gemm to TritonBench integration by @yf225 in #268
  • Fix some pyright errors by @jansel in #276
  • Remove unused exception types by @jansel in #271
  • Fix docstring see also lists by @jansel in #272
  • [benchmarks] Change tritonbench api by @xuzhao9 in #260
  • Initial versison of documentation by @jansel in #273
  • Deploy docs to github pages by @jansel in #277
  • Fix lint error on main by @jansel in #281
  • Add a link to the documentation by @jansel in #282
  • [Benchmark] Fix tritonbench integration due to upstream changes by @yf225 in #278
  • [Benchmark] Allow using 'python benchmarks/run.py' to run all kernels by @yf225 in #280
  • Add implicit broadcasting tests by @jansel in #285
  • Add additional tl.range choices to persistent loop by @jansel in #287
  • Update autotuning example in docs by @jansel in #288
  • Add host side dead code elimination by @oulgen in #289
  • [Benchmark] Add attention tritonbench integration by @yf225 in #284
  • Add helion.exc.CannotModifyHostVariableOnDevice and helion.exc.CannotReadDeviceVariableOnHost by @jansel in #290
  • Fix unstable CI by @jansel in #299
  • Make to_triton_code config arg optional by @jansel in #291
  • Add helion.exc.DeviceTensorSubscriptAssignmentNotAllowed by @jansel in #292
  • Remove default configs from examples by @jansel in #295
  • Fix bug with tensor descriptor and small block size by @jansel in #296
  • Relax typing for CombineFunction by @jansel in #297
  • Add examples/segment_reduction.py by @jansel in #300
  • Add error for using a host tensor directly by @jansel in #306
  • Improve Tensor.item() handling by @jansel in #307
  • Fix type_info null errors by @oulgen in #294
  • Improve DCE by marking math functions as pure by @oulgen in #312
  • [Benchmark] Add softmax tritonbench integration by @yf225 in #286
  • Make imports relative by @jansel in #310
  • Generalize l2_grouping to support 3+ dimensions by @jansel in #313
  • Remove make_precompiler generated wrapper by @jansel in #314
  • Enforce ANN/PGH lints by @jansel in #315
  • Support dynamic fill value to hl.full by @jansel in #316
  • Use tensor device reference in persistent kernels by @jansel in #317
  • Add tl._experimental_make_tensor_descriptor support by @oulgen in #322
  • Fix variable scoping in nested loops for multi-pass kernels by @yf225 in #324
  • Add HELION_DEV_LOW_VRAM env var for low GPU memory machines by @yf225 in #325
  • Add cross_entropy example and unit test by @yf225 in #320
  • [Benchmark] Add cross_entropy to tritonbench integration by @yf225 in #321
  • Add literal index into tuple by @joydddd in #327
  • Improve naming for generated helper functions by @jansel in #323
  • Add hl.inline_asm_elementwise by @jansel in #328
  • Implement static tuple unrolling and hl.static_range by @jansel in #329
  • Add fp8_attention example and unit test by @yf225 in #318
  • [Benchmark] Add fp8_attention to tritonbench integration by @yf225 in #319

New Contributors

Full Changelog: v0.0.9...v0.0.10

v0.0.9

08 Jul 19:27
902741b
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.0.8...v0.0.9

v0.0.8

01 Jul 15:16
43faf72
Compare
Choose a tag to compare

What's Changed

  • Improve loop end bound optimization for nested tiling by @jansel in #192
  • Set default dot_precision to TRITON_F32_DEFAULT by @jansel in #197
  • Use _disable_flatten_get_tile helper in tile_id by @jansel in #200
  • Throw type errors immediately by @jansel in #202
  • Fix typo in LiteralType.merge by @jansel in #201
  • Add support for global statements in type propagation by @jansel in #203
  • Remove ErrorReporting class and simplify warning handling by @jansel in #204
  • Add InvalidDeviceForLoop exception type by @jansel in #205
  • Fix bug with renamed variable flowing into phi() node by @jansel in #206
  • Move hl.grid tests to their own file by @jansel in #208
  • Remove NDGridTileStrategy by @jansel in #209
  • Simplify codegen for hl.grid by @jansel in #210
  • Add support for hl.grid(begin, end, step) by @jansel in #211
  • Support range() loops (alias for hl.grid) by @jansel in #212
  • Move yz_grid disabling logic to ConfigSpec by @jansel in #213
  • Relax chebyshev kernel test tolerance by @jansel in #214
  • [RFC] Add static loop unrolling by @oulgen in #216
  • Add support for torch.arange by @jansel in #215
  • Fix a performance issue with Helion-emitted Flash Attention by @manman-ren in #181
  • Fix issue with phi nodes and aliasing by @jansel in #220
  • Fix duplicate argument handling in inductor lowering by @jansel in #222
  • x[i] returns scalar when i=scalar by @joydddd in #223
  • Fix config flatten spec for tile.id by @joydddd in #224
  • Fix failing tests on main by @jansel in #231
  • Refactor examples to use run_example helper by @jansel in #225
  • Add tl.range loop_unroll_factor to autotuner by @jansel in #226
  • Add tl.range num_stages to autotuner by @jansel in #227
  • Add tl.range disallow_acc_multi_buffer to autotuner by @jansel in #228
  • Add tl.range flatten to autotuner by @jansel in #229

New Contributors

Full Changelog: v0.0.7...v0.0.8

v0.0.7

18 Jun 18:06
248ece6
Compare
Choose a tag to compare

What's Changed

  • Fix bug with computations based on hl.register_block_size by @jansel in #157
  • Generalize workaround for unbacked size hints by @jansel in #159
  • Don't hardcode cuda in test files by @jansel in #160
  • Move register_block_size/register_reduction_dim to tunable_ops.py by @jansel in #161
  • Unskip some previosly failing tests by @jansel in #162
  • Use workflow matrix to deduplicate code by @oulgen in #168
  • Rename TileIndexProxy to hl.Tile by @jansel in #171
  • Fix block size variable handling and atomic operations with symints by @jansel in #177
  • Codegen if tl.sum(one_elem_tensor): instead of if one_elem_tensor by @yf225 in #158
  • Fix visitCall in deviceIR. Always visit argument nodes by @joydddd in #180
  • Relax bounds on test_mask_dot by @oulgen in #182
  • Add lowering for Constant assignment by @joydddd in #187
  • Expose tile.id by @joydddd in #188
  • Do not precompile set configs by @oulgen in #183
  • Add option to ban/disallow autotuning by @oulgen in #184
  • Recommend PyTorch nightly build in readme by @jansel in #193
  • Fix issue with ConfigSpec mutation in codegen by @jansel in #195
  • enable_python_dispatcher() in propagate_types by @laithsakka in #191

New Contributors

Full Changelog: v0.0.6...v0.0.7

v0.0.6

12 Jun 20:42
b9e93c0
Compare
Choose a tag to compare

What's Changed

  • Fix ast read writes by @oulgen in #148
  • Update pre-commit by @oulgen in #149
  • Try enable test_moe_matmul_ogs on CI by @yf225 in #147
  • [Ready for review] Add support for print(prefix_str, *tensors) by @yf225 in #140
  • Support hl.tile_{begin,end,block_size} by @jansel in #150
  • Rename TileStrategy.get_block_index to CompileEnvironment.get_block_id by @jansel in #151
  • Fix bug in merging sequence types by @jansel in #152
  • Increase atol for test_matmul_split_k by @jansel in #155
  • Fix bug in test_matmul_split_k by @jansel in #156
  • Add hl.register_tunable by @jansel in #154

Full Changelog: v0.0.5...v0.0.6

v0.0.5

09 Jun 15:53
9a9f3e7
Compare
Choose a tag to compare

What's Changed

  • Rename linter/check_main.py -> scripts/lint_examples_main.py by @jansel in #124
  • Improve error message for unpacking a tile by @jansel in #125
  • Improve error message for overpacked tiles by @jansel in #126
  • [BC breaking] Simplify block size configs by @jansel in #127
  • Refactor reduction loop config spec by @jansel in #128
  • Move BlockIdSequence to its own file by @jansel in #129
  • Do not print output code durring autotuning by @jansel in #130
  • Make helion.exc.TensorOperationInWrapper not fire on non-torch ops by @jansel in #131
  • Add HELION_FORCE_AUTOTUNE=1 and update readme by @jansel in #132
  • Correct units for time printouts by @jansel in #133
  • Rename block_size_idx to block_id by @jansel in #134
  • Rename block_indices to block_ids by @jansel in #135
  • Add Pyre Pre-Commit Hook by @lolpack in #136
  • Update .pre-commit-config.yaml by @oulgen in #137
  • [Ready for review] Add hl.register_reduction_dim(); add support for matmul+layernorm example by @yf225 in #80
  • Fix bug with errors on unreachable if branch by @jansel in #138
  • [Error Message] Update block config size length mismatch by @drisspg in #139
  • Increase atol/rtol for test_error_in_non_taken_branch by @jansel in #142
  • Fix some typos by @jansel in #141
  • More fair comparison by @drisspg in #146

New Contributors

Full Changelog: v0.0.4...v0.0.5