You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# 🏎️ Learn the Hugging Face Kernel Hub in 5 Minutes
10
13
11
-
**Unlock performance boosts for your models with pre-optimized compute kernels, easily loaded from the Hub.**
14
+
**Boost your model performance with pre-optimized kernels, easily loaded from the Hub.**
12
15
13
-
Today, we'll explore an exciting development from Hugging Face: the **Kernel Hub**! As ML practitioners, we know that maximizing performance often involves diving deep into optimized code, custom CUDA kernels, or complex build systems. The Kernel Hub aims to simplify this dramatically.
16
+
Today, we'll explore an exciting development from Hugging Face: the **Kernel Hub**! As ML practitioners, we know that maximizing performance often involves diving deep into optimized code, custom CUDA kernels, or complex build systems. The Kernel Hub simplifies this process dramatically!
14
17
15
18
We'll cover the following topics:
16
19
@@ -19,30 +22,36 @@ We'll cover the following topics:
19
22
3.**Adding a Kernel to a Simple Model** - A practical integration using RMSNorm.
20
23
4.**Reviewing Performance Impact** - Benchmarking the RMSNorm difference.
21
24
22
-
And we'll introduce these concepts quickly – the core idea can be grasped in about 5 minutes (though experimenting and benchmarking might take a bit longer!).
25
+
We'll introduce these concepts quickly – the core idea can be grasped in about 5 minutes (though experimenting and benchmarking might take a bit longer!).
23
26
24
27
## 1. What is the Kernel Hub?
25
28
26
-
The [Kernel Hub](https://huggingface.co/kernels) (👈 Check it out!) allows Python libraries and applications to **load optimized compute kernels directly from the Hugging Face Hub**. Think of it like the Model Hub, but for low-level, high-performance code snippets (kernels) that accelerate specific operations, often on GPUs. Examples include optimized attention mechanisms (like FlashAttention), activation functions, and normalization layers (like LayerNorm or RMSNorm).
29
+
30
+
The [Kernel Hub](https://huggingface.co/kernels-community) (👈 Check it out!) allows Python libraries and applications to **load optimized compute kernels directly from the Hugging Face Hub**. Think of it like the Model Hub, but for low-level, high-performance code snippets (kernels) that accelerate specific operations, often on GPUs.
31
+
32
+
Examples include advanced attention mechanisms (like [FlashAttention](https://huggingface.co/kernels-community/flash-attn) for dramatic speedups and memory savings). Custom [quantization kernels](https://huggingface.co/kernels-community/quantization) (enabling efficient computation with lower-precision data types like INT8 or INT4). Specialized kernels required for complex architectures like [Mixture of Experts (MoE) layers](https://huggingface.co/kernels-community/moe), which involve intricate routing and computation patterns. As well as [activation functions](https://huggingface.co/kernels-community/activation), and [normalization layers (like LayerNorm or RMSNorm)](https://huggingface.co/kernels-community/triton-layer-norm).
27
33
28
34
Instead of manually managing complex dependencies, wrestling with compilation flags, or building libraries like Triton or CUTLASS from source, you can use the `kernels` library to instantly fetch and run pre-compiled, optimized kernels.
29
35
30
36
### Benefits of the Kernel Hub:
31
37
32
-
***Instant Access to Optimized Kernels**: Load and run kernels optimized for various hardware (like NVIDIA GPUs) without local compilation hassles.
38
+
***Instant Access to Optimized Kernels**: Load and run kernels optimized for various hardware starting with NVIDIA and AMD GPUs, without local compilation hassles.
33
39
***Share and Reuse**: Discover, share, and reuse kernels across different projects and the community.
34
40
***Easy Updates**: Stay up-to-date with the latest kernel improvements simply by pulling the latest version from the Hub.
35
41
***Accelerate Development**: Focus on your model architecture and logic, not on the intricacies of kernel compilation and deployment.
36
42
***Improve Performance**: Leverage kernels optimized by experts to potentially speed up training and inference.
37
43
***Simplify Deployment**: Reduce the complexity of your deployment environment by fetching kernels on demand.
44
+
***Develop and Share Your Own Kernels**: If you create optimized kernels, you can easily share them on the Hub for others to use. This encourages collaboration and knowledge sharing within the community.
38
45
39
46
> As many machine learning developers know, managing dependencies and building low-level code from source can be a time-consuming and error-prone process. The Kernel Hub aims to simplify this by providing a centralized repository of optimized compute kernels that can be easily loaded and run.
40
47
41
48
Spend more time building great models and less time fighting build systems!
42
49
43
50
## 2. How to Use the Kernel Hub (Basic Example)
44
51
45
-
Using the Kernel Hub is designed to be straightforward. The `kernels` library provides the main interface. Here's a quick example loading an optimized GELU activation function kernel (we'll use a different kernel for the main example later).
52
+
Using the Kernel Hub is designed to be straightforward. The `kernels` library provides the main interface. Here's a quick example that loads an optimized GELU activation function kernel. (Later on, we'll see another example about how to integrate a kernel in our model).
53
+
54
+
File: `activation_validation_example.py`
46
55
47
56
~~~python
48
57
# /// script
@@ -54,19 +63,15 @@ Using the Kernel Hub is designed to be straightforward. The `kernels` library pr
54
63
# ///
55
64
56
65
import torch
66
+
import torch.nn.functional as F
57
67
from kernels import get_kernel
58
68
59
-
# Ensure you have a CUDA-enabled device
60
-
ifnot torch.cuda.is_available():
61
-
raiseRuntimeError("This example requires a CUDA-enabled GPU")
62
-
63
69
DEVICE="cuda"
64
70
65
71
# Make reproducible
66
72
torch.manual_seed(42)
67
73
68
74
# Download optimized activation kernels from the Hub
69
-
# This fetches the kernel code if not already cached
***Kernel Inheritance:** The `KernelRMSNorm` class inherits from `layer_norm_kernel_module.layers.LlamaRMSNorm`, which is the RMSNorm implementation in the kernel. This allows us to use the optimized kernel directly.
303
342
***Accessing the Function:** The exact way to access the RMSNorm function (`layer_norm_kernel_module.layers.LlamaRMSNorm.forward`, `layer_norm_kernel_module.rms_norm_forward`, or something else) **depends entirely on how the kernel creator structured the repository on the Hub.** You may need to inspect the loaded `layer_norm_kernel_module` object (e.g., using `dir()`) or check the kernel's documentation on the Hub to find the correct function/method and its signature. I've used `rms_norm_forward` as a plausible placeholder and added error handling.
304
343
***Parameters:** We now only define `rms_norm_weight` (no bias), consistent with RMSNorm.
@@ -307,6 +346,9 @@ except NameError:
307
346
308
347
Does using the optimized Triton RMSNorm kernel provide a speedup compared to the basic PyTorch version? Let's benchmark the forward pass again.
309
348
349
+
350
+
File: `rmsnorm_benchmark.py`
351
+
310
352
~~~python
311
353
# /// script
312
354
# dependencies = [
@@ -319,12 +361,10 @@ import torch
319
361
320
362
# reuse the models from the previous snippets or copy the class
321
363
# definitions here to run this script independently
322
-
fromsnippet2import BaselineModel
323
-
fromsnippet3import KernelModel
364
+
fromrmsnorm_baselineimport BaselineModel
365
+
fromrmsnorm_kernelimport KernelModel
324
366
325
367
DEVICE="cuda"
326
-
ifnot torch.cuda.is_available():
327
-
raiseRuntimeError("This example requires a CUDA-enabled GPU")
328
368
DTYPE= torch.float16 # Use float16 for better kernel performance potential
329
369
330
370
@@ -462,7 +502,7 @@ You've seen how easy it is to fetch and use optimized kernels with the Hugging F
462
502
~~~bash
463
503
pip install kernels torch numpy
464
504
~~~
465
-
Ensure you have a compatible PyTorch version and CUDA installed if using GPU kernels.
505
+
Ensure you have a compatible PyTorch version and gpu driver installed.
466
506
467
507
2. **Browse the Hub:** Explore available kernels on the Hugging Face Hub under the [`kernels` tag](https://huggingface.co/kernels) or within organizations like [`kernels-community`](https://huggingface.co/kernels-community). Look for kernels relevant to your operations (activations, attention, normalization like LayerNorm/RMSNorm, etc.).
0 commit comments