-
Notifications
You must be signed in to change notification settings - Fork 866
feat: kernel hub introduction draft #2777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, looking great! I did a quick early pass, feel free to ping again when you want!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! But too wide I think, it will be cropped at the sides possibly hiding part of the title. The recommended aspect ratio is 2:1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! updated to be 2:1 in the latest commits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder that we also have to add an entry to _blog.yml
when you are ready to submit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh thanks for the tip, added an entry in the latest commit (and will make sure to bump when the article is ready)
thumbnail: /blog/assets/hello-hf-kernels/kernel-hub-five-mins-short.png | ||
authors: | ||
- user: drbh | ||
date: 2025-03-28 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Date goes in _blog.yml
using a format like "March 28, 2025"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! updated in the latest commits
hello-hf-kernels.md
Outdated
|
||
# 🏎️ Learn the Hugging Face Kernel Hub in 5 Minutes | ||
|
||
**Unlock performance boosts for your models with pre-optimized compute kernels, easily loaded from the Hub.** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
**Unlock performance boosts for your models with pre-optimized compute kernels, easily loaded from the Hub.** | |
**Boost your model performance with pre-optimized kernels, easily loaded from the Hub.** |
Maybe, for simplification?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! updated in the latest commits
hello-hf-kernels.md
Outdated
|
||
**Unlock performance boosts for your models with pre-optimized compute kernels, easily loaded from the Hub.** | ||
|
||
Today, we'll explore an exciting development from Hugging Face: the **Kernel Hub**! As ML practitioners, we know that maximizing performance often involves diving deep into optimized code, custom CUDA kernels, or complex build systems. The Kernel Hub aims to simplify this dramatically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today, we'll explore an exciting development from Hugging Face: the **Kernel Hub**! As ML practitioners, we know that maximizing performance often involves diving deep into optimized code, custom CUDA kernels, or complex build systems. The Kernel Hub aims to simplify this dramatically. | |
Today, we'll explore an exciting development from Hugging Face: the **Kernel Hub**! As ML practitioners, we know that maximizing performance often involves diving deep into optimized code, custom CUDA kernels, or complex build systems. The Kernel Hub simplifies this process dramatically! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh this is better, updated in latest commit
hello-hf-kernels.md
Outdated
expected = torch.tensor( | ||
[ | ||
[0.1100, 2.1309, -0.0700, 0.6802], | ||
[-0.0500, 0.4800, -0.1700, -0.1700], | ||
[0.3701, -0.1300, -0.0800, -0.1200], | ||
[-0.0400, 0.1200, -0.1500, 1.7998], | ||
], | ||
dtype=torch.float16, | ||
device=DEVICE, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps an alternative could be to retrieve the reference results from PyTorch's gelu?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea agreed that is a better example, updated in latest commit
hello-hf-kernels.md
Outdated
|
||
## 2. How to Use the Kernel Hub (Basic Example) | ||
|
||
Using the Kernel Hub is designed to be straightforward. The `kernels` library provides the main interface. Here's a quick example loading an optimized GELU activation function kernel (we'll use a different kernel for the main example later). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the Kernel Hub is designed to be straightforward. The `kernels` library provides the main interface. Here's a quick example loading an optimized GELU activation function kernel (we'll use a different kernel for the main example later). | |
Using the Kernel Hub is designed to be straightforward. The `kernels` library provides the main interface. Here's a quick example that loads an optimized GELU activation function kernel. (Later on, we'll see another example about how to integrate a kernel in our model). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks this reads better, updated in latest
|
||
**Important Notes on the `KernelModel`:** | ||
* **Kernel Inheritance:** The `KernelRMSNorm` class inherits from `layer_norm_kernel_module.layers.LlamaRMSNorm`, which is the RMSNorm implementation in the kernel. This allows us to use the optimized kernel directly. | ||
* **Accessing the Function:** The exact way to access the RMSNorm function (`layer_norm_kernel_module.layers.LlamaRMSNorm.forward`, `layer_norm_kernel_module.rms_norm_forward`, or something else) **depends entirely on how the kernel creator structured the repository on the Hub.** You may need to inspect the loaded `layer_norm_kernel_module` object (e.g., using `dir()`) or check the kernel's documentation on the Hub to find the correct function/method and its signature. I've used `rms_norm_forward` as a plausible placeholder and added error handling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice if we can point to some kernel documentation (in the kernel's model card in the Hub) by the time this is published :) This could encourage others to adopt some common structure for kernel description / docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed! currently there is a effort to generate some useful docs started here huggingface/kernel-builder#89 however this is still a work in progress and should be updated before publishing
TODO
- improve docs across all existing examples (probably autogen)
hello-hf-kernels.md
Outdated
from snippet2 import BaselineModel | ||
from snippet3 import KernelModel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should introduce the script name before each snippet, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, updated to have meaningful names and use them in the scripts in latest
|
||
# Download optimized activation kernels from the Hub | ||
# This fetches the kernel code if not already cached | ||
activation_kernels = get_kernel("kernels-community/activation") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super cool! Would something like this (different kernel) be automatically resolved? Do we want to talk (in a later section) about what happens if there's no match?
hello-hf-kernels.md
Outdated
|
||
### Benefits of the Kernel Hub: | ||
|
||
* **Instant Access to Optimized Kernels**: Load and run kernels optimized for various hardware (like NVIDIA GPUs) without local compilation hassles. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* **Instant Access to Optimized Kernels**: Load and run kernels optimized for various hardware (like NVIDIA GPUs) without local compilation hassles. | |
* **Instant Access to Optimized Kernels**: Load and run kernels optimized for various hardware starting with NVIDIA and AMD GPUs, without local compilation hassles. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! updated in the latest commits
hello-hf-kernels.md
Outdated
~~~bash | ||
pip install kernels torch numpy | ||
~~~ | ||
Ensure you have a compatible PyTorch version and CUDA installed if using GPU kernels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this hardware agnostic for AMD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, i've updated the phrasing to avoid "CUDA" in the latest commit
hello-hf-kernels.md
Outdated
|
||
## 1. What is the Kernel Hub? | ||
|
||
The [Kernel Hub](https://huggingface.co/kernels) (👈 Check it out!) allows Python libraries and applications to **load optimized compute kernels directly from the Hugging Face Hub**. Think of it like the Model Hub, but for low-level, high-performance code snippets (kernels) that accelerate specific operations, often on GPUs. Examples include optimized attention mechanisms (like FlashAttention), activation functions, and normalization layers (like LayerNorm or RMSNorm). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to mention some challenging kernels here. I think activation and normalization kernels are usually pretty good in frameworks. Maybe, attention mechanisms, quantizers, and Mixture of Expert layers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, updated to include some more impactful/useful examples. thanks!
hello-hf-kernels.md
Outdated
# Ensure you have a CUDA-enabled device | ||
if not torch.cuda.is_available(): | ||
raise RuntimeError("This example requires a CUDA-enabled GPU") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me upload the activation
kernel for ROCm as well. I think the example is stronger if we can show something that works with both CUDA and ROCm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Built, running validation tests now...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All tests pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wooo amazing, thank you!
hello-hf-kernels.md
Outdated
if not torch.cuda.is_available(): | ||
raise RuntimeError("This example requires a CUDA-enabled GPU") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the Triton kernel should also work with ROCm? Worth trying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome, thanks for building/testing! removed torch.cuda..
in the latest commit
hello-hf-kernels.md
Outdated
layer_norm_kernel_module = get_kernel("kernels-community/triton-layer-norm") | ||
|
||
|
||
class KernelRMSNorm(layer_norm_kernel_module.layers.LlamaRMSNorm): | ||
def __init__(self, hidden_size, variance_epsilon=1e-5): | ||
super().__init__() | ||
self.weight = nn.Parameter(torch.ones(hidden_size)) | ||
self.variance_epsilon = variance_epsilon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want people to use @use_kernel_forward_from_hub
to annotate the Torch class and then register LlamaRMSNorm
using a mapping. See: https://github.com/huggingface/kernels/blob/main/docs/layers.md
Using @use_kernel_forward_from_hub
enables people to make layers that are (dynamically) extensible with kernels, people can replace kernels, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yea great point! I've updated the code to prefer adding @use_kernel_forward_from_hub("LlamaRMSNorm")
to the RMSNorm
defined in the reference example (and added some descriptive comments).
hello-hf-kernels.md
Outdated
): | ||
super().__init__() | ||
self.linear1 = nn.Linear(input_size, hidden_size) | ||
self.norm = KernelRMSNorm(hidden_size, variance_epsilon=eps) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With @use_kernel_forward_from_hub
, you don't need this. The model doesn't need any change to use kernels, the model writer or the user can map kernels externally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this has been updated in the latest commit along with the larger change to prefer using the use_kernel_forward_from_hub
decorator in the example. thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was so curious about kernels, great work 🤗
* **Simplify Deployment**: Reduce the complexity of your deployment environment by fetching kernels on demand. | ||
* **Develop and Share Your Own Kernels**: If you create optimized kernels, you can easily share them on the Hub for others to use. This encourages collaboration and knowledge sharing within the community. | ||
|
||
> As many machine learning developers know, managing dependencies and building low-level code from source can be a time-consuming and error-prone process. The Kernel Hub aims to simplify this by providing a centralized repository of optimized compute kernels that can be easily loaded and run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a quote format btw!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh thanks, updated in latest changes
hello-hf-kernels.md
Outdated
|
||
Using the Kernel Hub is designed to be straightforward. The `kernels` library provides the main interface. Here's a quick example that loads an optimized GELU activation function kernel. (Later on, we'll see another example about how to integrate a kernel in our model). | ||
|
||
File: `activation_validation_example.py` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add a link to the file so it's easy for people to directly check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great point, ive made all of the files gists and added links to them! thanks
hello-hf-kernels.md
Outdated
|
||
## 4. Review Performance Impact | ||
|
||
Does using the optimized Triton RMSNorm kernel provide a speedup compared to the basic PyTorch version? Let's benchmark the forward pass again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does using the optimized Triton RMSNorm kernel provide a speedup compared to the basic PyTorch version? Let's benchmark the forward pass again. | |
Does optimized Triton RMSNorm kernel speeds up compared to the kernel in basic PyTorch? Let's benchmark the forward pass again. |
sentence felt a bit hard to read, rephrased (if you feel like it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch thanks for the suggestion, I ended up rewriting that part to
4. Benchmarking the Performance Impact
How much faster is the optimized Triton RMSNorm kernel compared to the standard PyTorch version? Let’s benchmark the forward pass to find out.
File: rmsnorm_benchmark.py
...
* Potential overhead for small inputs. | ||
|
||
|
||
Actual results will depend on your hardware and the specific kernel implementation. Here's an example of what you might see (on a L4 GPU): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you! fixed to be a table in the latest changes 🙏
292fba4
to
3d58ec1
Compare
Co-authored-by: Merve Noyan <[email protected]>
…s/core contributors and syntax edits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely done! left some suggestions for readability and to hook the devs in a bit more overall!
date: 2025-03-28 | ||
--- | ||
|
||
# 🏎️ Learn the Hugging Face Kernel Hub in 5 Minutes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smol suggestion
# 🏎️ Learn the Hugging Face Kernel Hub in 5 Minutes | |
# 🏎️ Boost your model performance with high performance via Hugging Face Kernels hub |
Just a bit more descriptive title (for an average person not familiar with kernels as much, it might not be as descriptive) - basically repurposed the line below the title.
feel free to ignore the suggestion and make something better it's just for reference.
**Boost your model performance with pre-optimized kernels, easily loaded from the Hub.** | ||
|
||
Today, we'll explore an exciting development from Hugging Face: the **Kernel Hub**! As ML practitioners, we know that maximizing performance often involves diving deep into optimized code, custom CUDA kernels, or complex build systems. The Kernel Hub simplifies this process dramatically! | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be helpful here to put a small code snippet or a benchmark about a kernel from the hub. or some notion of how it'd look like in a dev PoV - usually acts as a good hook and helps the reader visualise what the rest of the blog would allude too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could also serve as a nice TL;DR to the blogpost as well.
|
||
The [Kernel Hub](https://huggingface.co/kernels-community) (👈 Check it out!) allows Python libraries and applications to **load optimized compute kernels directly from the Hugging Face Hub**. Think of it like the Model Hub, but for low-level, high-performance code snippets (kernels) that accelerate specific operations, often on GPUs. | ||
|
||
Examples include advanced attention mechanisms (like [FlashAttention](https://huggingface.co/kernels-community/flash-attn) for dramatic speedups and memory savings). Custom [quantization kernels](https://huggingface.co/kernels-community/quantization) (enabling efficient computation with lower-precision data types like INT8 or INT4). Specialized kernels required for complex architectures like [Mixture of Experts (MoE) layers](https://huggingface.co/kernels-community/moe), which involve intricate routing and computation patterns. As well as [activation functions](https://huggingface.co/kernels-community/activation), and [normalization layers (like LayerNorm or RMSNorm)](https://huggingface.co/kernels-community/triton-layer-norm). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, not sure if this is intended some of the kernels have build error
on their README for ex: https://huggingface.co/kernels-community/flash-attn
3. **Adding a Kernel to a Simple Model** - A practical integration using RMSNorm. | ||
4. **Reviewing Performance Impact** - Benchmarking the RMSNorm difference. | ||
|
||
We'll introduce these concepts quickly – the core idea can be grasped in about 5 minutes (though experimenting and benchmarking might take a bit longer!). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhere around here it might be good to mention that we are actually using this in Transformers and TGI with maybe a code pointer.
This would convey the fact that this work is already used in production and integrated in downstream libraries. (or at least allude to it)
|
||
Examples include advanced attention mechanisms (like [FlashAttention](https://huggingface.co/kernels-community/flash-attn) for dramatic speedups and memory savings). Custom [quantization kernels](https://huggingface.co/kernels-community/quantization) (enabling efficient computation with lower-precision data types like INT8 or INT4). Specialized kernels required for complex architectures like [Mixture of Experts (MoE) layers](https://huggingface.co/kernels-community/moe), which involve intricate routing and computation patterns. As well as [activation functions](https://huggingface.co/kernels-community/activation), and [normalization layers (like LayerNorm or RMSNorm)](https://huggingface.co/kernels-community/triton-layer-norm). | ||
|
||
Instead of manually managing complex dependencies, wrestling with compilation flags, or building libraries like Triton or CUTLASS from source, you can use the `kernels` library to instantly fetch and run pre-compiled, optimized kernels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can also give an example here of FA2/ FA3 of building from source vs using pre-compiled in terms of time taken/ compute required etc
This would be make for a good cementing factor
|
||
4. **Benchmark:** Measure the performance impact on your specific hardware and workload. Don't forget to check for numerical correctness (`torch.testing.assert_close`). | ||
|
||
5. **(Advanced) Contribute:** If you develop optimized kernels, consider sharing them on the Hub! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe here open an issue in the kernels repo/ kernels org on the hub so that people can request some kernels as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, issue/ discussion on kernels org on the hub would be even better.
This PR is an early draft for an introduction to the kernel hub
TODO
kernel-builder
to showcase kernel creation/publishing to the hub