ENH: Enable RunsOn GPU support for lecture builds (#437)

mmcky · HumphreyYang · jstac · web-flow · commit 9045b9f1cdfa · 2025-11-28T08:57:03.000+11:00
* Enable RunsOn GPU support for lecture builds - Add scripts/test-jax-install.py to verify JAX/GPU installation - Add .github/runs-on.yml with QuantEcon Ubuntu 24.04 AMI configuration - Update cache.yml to use RunsOn g4dn.2xlarge GPU runner - Update ci.yml to use RunsOn g4dn.2xlarge GPU runner - Update publish.yml to use RunsOn g4dn.2xlarge GPU runner - Install JAX with CUDA 13 support and Numpyro on all workflows - Add nvidia-smi check to verify GPU availability This mirrors the setup used in lecture-python.myst repository. * DOC: Update JAX lectures with GPU admonition and narrative - Add standard GPU admonition to jax_intro.md and numpy_vs_numba_vs_jax.md - Update introduction in jax_intro.md to reflect GPU access - Update conditional GPU language to reflect lectures now run on GPU - Following QuantEcon style guide for JAX lectures * DEBUG: Add hardware benchmark script to diagnose performance - Add benchmark-hardware.py with CPU, NumPy, Numba, and JAX benchmarks - Works on both GPU (RunsOn) and CPU-only (GitHub Actions) runners - Include warm-up vs compiled timing to isolate JIT overhead - Add system info collection (CPU model, frequency, GPU detection) * Add multi-pathway benchmark tests (bare metal, Jupyter, jupyter-book) * Fix: Add content to benchmark-jupyter.ipynb (was empty) * Fix: Add benchmark content to benchmark-jupyter.ipynb * Add JSON output to benchmarks and upload as artifacts - Update benchmark-hardware.py to save results to JSON - Update benchmark-jupyter.ipynb to save results to JSON - Update benchmark-jupyterbook.md to save results to JSON - Add CI step to collect and display benchmark results - Add CI step to upload benchmark results as artifact * Fix syntax errors in benchmark-hardware.py - Remove extra triple quote at start of file - Remove stray parentheses in result assignments * Sync benchmark scripts with CPU branch for comparable results - Copy benchmark-hardware.py from debug/benchmark-github-actions - Copy benchmark-jupyter.ipynb from debug/benchmark-github-actions - Copy benchmark-jupyterbook.md from debug/benchmark-github-actions - Update ci.yml to use matching file names The test scripts are now identical between both branches, only the CI workflow differs (runner type and JAX installation). * ENH: Force lax.scan sequential operation to run on CPU Add device=cpu to the qm_jax function decorator to avoid the known XLA limitation where lax.scan with millions of lightweight iterations performs poorly on GPU due to CPU-GPU synchronization overhead. Added explanatory note about this pattern. Co-authored-by: HumphreyYang <Humphrey.Yang@anu.edu.au> * update note * Add lax.scan profiler to CI for GPU debugging - Add scripts/profile_lax_scan.py: Profiles lax.scan performance on GPU vs CPU to investigate the synchronization overhead issue (JAX Issue #2491) - Add CI step to run profiler with 100K iterations on RunsOn GPU environment - Script supports multiple profiling modes: basic timing, Nsight, JAX profiler, XLA dumps * Add diagnostic mode to lax.scan profiler - Add --diagnose flag that tests time scaling across iteration counts - If time scales linearly with iterations (not compute), it proves constant per-iteration overhead (CPU-GPU synchronization) - Also add --verbose flag for CUDA/XLA logging - Update CI to run with --diagnose flag * Add Nsight Systems profiling to CI - Run nsys profile with 1000 iterations if nsys is available - Captures CUDA, NVTX, and OS runtime traces - Uploads .nsys-rep file as artifact for visual analysis - continue-on-error: true so CI doesn't fail if nsys unavailable * address @jstac comment * Improve JAX lecture content and pedagogy - Reorganize jax_intro.md to introduce JAX features upfront with clearer structure - Expand JAX introduction with bulleted list of key capabilities (parallelization, JIT, autodiff) - Add explicit GPU performance notes in vmap sections - Enhance vmap explanation with detailed function composition breakdown - Clarify memory efficiency tradeoffs between different vmap approaches 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Remove benchmark scripts (moved to QuantEcon/benchmarks) - Remove profile_lax_scan.py, benchmark-hardware.py, benchmark-jupyter.ipynb, benchmark-jupyterbook.md - Remove profiling/benchmarking steps from ci.yml - Keep test-jax-install.py for JAX installation verification Benchmark scripts are now maintained in: https://github.com/QuantEcon/benchmarks * Update lectures/numpy_vs_numba_vs_jax.md * Add GPU and JAX hardware details to status page - Add nvidia-smi output to show GPU availability - Add JAX backend check to confirm GPU usage - Matches format used in lecture-python.myst --------- Co-authored-by: HumphreyYang <Humphrey.Yang@anu.edu.au> Co-authored-by: Humphrey Yang <u6474961@anu.edu.au> Co-authored-by: John Stachurski <john.stachurski@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>
diff --git a/.github/workflows/cache.yml b/.github/workflows/cache.yml
@@ -6,7 +6,7 @@ on:
   workflow_dispatch:
 jobs:
   cache:
-    runs-on: ubuntu-latest
+    runs-on: "runs-on=${{ github.run_id }}/family=g4dn.2xlarge/image=quantecon_ubuntu2404/disk=large"
     steps:
       - uses: actions/checkout@v6
       - name: Setup Anaconda
@@ -18,6 +18,16 @@ jobs:
           python-version: "3.13"
           environment-file: environment.yml
           activate-environment: quantecon
+      - name: Install JAX and Numpyro
+        shell: bash -l {0}
+        run: |
+          pip install -U "jax[cuda13]"
+          pip install numpyro
+          python scripts/test-jax-install.py
+      - name: Check nvidia drivers
+        shell: bash -l {0}
+        run: |
+          nvidia-smi
       - name: Build HTML
         shell: bash -l {0}
         run: |
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -2,7 +2,7 @@ name: Build Project [using jupyter-book]
 on: [pull_request]
 jobs:
   preview:
-    runs-on: ubuntu-latest
+    runs-on: "runs-on=${{ github.run_id }}/family=g4dn.2xlarge/image=quantecon_ubuntu2404/disk=large"
     steps:
       - uses: actions/checkout@v6
         with:
@@ -16,6 +16,15 @@ jobs:
           python-version: "3.13"
           environment-file: environment.yml
           activate-environment: quantecon
+      - name: Check nvidia Drivers
+        shell: bash -l {0}
+        run: nvidia-smi
+      - name: Install JAX and Numpyro
+        shell: bash -l {0}
+        run: |
+          pip install -U "jax[cuda13]"
+          pip install numpyro
+          python scripts/test-jax-install.py
       - name: Install latex dependencies
         run: |
           sudo apt-get -qq update
diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml
@@ -6,7 +6,7 @@ on:
 jobs:
   publish:
     if: github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags')
-    runs-on: ubuntu-latest
+    runs-on: "runs-on=${{ github.run_id }}/family=g4dn.2xlarge/image=quantecon_ubuntu2404/disk=large"
     steps:
       - name: Checkout
         uses: actions/checkout@v6
@@ -21,6 +21,16 @@ jobs:
           python-version: "3.13"
           environment-file: environment.yml
           activate-environment: quantecon
+      - name: Install JAX and Numpyro
+        shell: bash -l {0}
+        run: |
+          pip install -U "jax[cuda13]"
+          pip install numpyro
+          python scripts/test-jax-install.py
+      - name: Check nvidia drivers
+        shell: bash -l {0}
+        run: |
+          nvidia-smi
       - name: Install latex dependencies
         run: |
           sudo apt-get -qq update
diff --git a/lectures/jax_intro.md b/lectures/jax_intro.md
@@ -13,6 +13,18 @@ kernelspec:
 
 # JAX
 
+This lecture provides a short introduction to [Google JAX](https://github.com/jax-ml/jax).
+
+JAX is a high-performance scientific computing library that provides 
+
+* a NumPy-like interface that can automatically parallize across CPUs and GPUs,
+* a just-in-time compiler for accelerating a large range of numerical
+  operations, and
+* automatic differentiation.
+
+Increasingly, JAX also maintains and provides more specialized scientific
+computing routines, such as those originally found in SciPy.
+
 In addition to what's in Anaconda, this lecture will need the following libraries:
 
 ```{code-cell} ipython3
@@ -21,28 +33,24 @@ In addition to what's in Anaconda, this lecture will need the following librarie
 !pip install jax quantecon
 ```
 
-This lecture provides a short introduction to [Google JAX](https://github.com/jax-ml/jax).
-
-Here we are focused on using JAX on the CPU, rather than on accelerators such as
-GPUs or TPUs.
-
-This means we will only see a small amount of the possible benefits from using JAX.
-
-However, JAX seamlessly handles transitions across different hardware platforms.
+```{admonition} GPU
+:class: warning
 
-As a result, if you run this code on a machine with a GPU and a GPU-aware
-version of JAX installed, your code will be automatically accelerated and you
-will receive the full benefits.
+This lecture is accelerated via [hardware](status:machine-details) that has access to a GPU and target JAX for GPU programming.
 
-For a discussion of JAX on GPUs, see [our JAX lecture series](https://jax.quantecon.org/intro.html).
+Free GPUs are available on Google Colab.
+To use this option, please click on the play icon top right, select Colab, and set the runtime environment to include a GPU.
 
+Alternatively, if you have your own GPU, you can follow the [instructions](https://github.com/google/jax) for installing JAX with GPU support.
+If you would like to install JAX running on the `cpu` only you can use `pip install jax[cpu]`
+```
 
 ## JAX as a NumPy Replacement
 
-One of the attractive features of JAX is that, whenever possible, it conforms to
-the NumPy API for array operations.
+One of the attractive features of JAX is that, whenever possible, its array
+processing operations conform to the NumPy API.
 
-This means that, to a large extent, we can use JAX is as a drop-in NumPy replacement.
+This means that, in many cases, we can use JAX is as a drop-in NumPy replacement.
 
 Let's look at the similarities and differences between JAX and NumPy.
 
@@ -523,16 +531,9 @@ with qe.Timer():
     jax.block_until_ready(y);
 ```
 
-If you are running this on a GPU the code will run much faster than its NumPy
-equivalent, which ran on the CPU.
-
-Even if you are running on a machine with many CPUs, the second JAX run should
-be substantially faster with JAX.
-
-Also, typically, the second run is faster than the first.
+On a GPU, this code runs much faster than its NumPy equivalent.
 
-(This might not be noticable on the CPU but it should definitely be noticable on
-the GPU.)
+Also, typically, the second run is faster than the first due to JIT compilation.
 
 This is because even built in functions like `jnp.cos` are JIT-compiled --- and the
 first run includes compile time.
@@ -634,8 +635,7 @@ with qe.Timer():
     jax.block_until_ready(y);
 ```
 
-The outcome is similar to the `cos` example --- JAX is faster, especially if you
-use a GPU and especially on the second run.
+The outcome is similar to the `cos` example --- JAX is faster, especially on the second run after JIT compilation.
 
 Moreover, with JAX, we have another trick up our sleeve:
 
diff --git a/lectures/numpy_vs_numba_vs_jax.md b/lectures/numpy_vs_numba_vs_jax.md
@@ -48,6 +48,18 @@ tags: [hide-output]
 !pip install quantecon jax
 ```
 
+```{admonition} GPU
+:class: warning
+
+This lecture is accelerated via [hardware](status:machine-details) that has access to a GPU and target JAX for GPU programming.
+
+Free GPUs are available on Google Colab.
+To use this option, please click on the play icon top right, select Colab, and set the runtime environment to include a GPU.
+
+Alternatively, if you have your own GPU, you can follow the [instructions](https://github.com/google/jax) for installing JAX with GPU support.
+If you would like to install JAX running on the `cpu` only you can use `pip install jax[cpu]`
+```
+
 We will use the following imports.
 
 ```{code-cell} ipython3
@@ -317,7 +329,7 @@ with qe.Timer(precision=8):
     z_max = jnp.max(f(x_mesh, y_mesh)).block_until_ready()
 ```
 
-Once compiled, JAX will be significantly faster than NumPy, especially if you are using a GPU.
+Once compiled, JAX is significantly faster than NumPy due to GPU acceleration.
 
 The compilation overhead is a one-time cost that pays off when the function is called repeatedly.
 
@@ -370,23 +382,29 @@ with qe.Timer(precision=8):
     z_max.block_until_ready()
 ```
 
-The execution time is similar to the mesh operation but, by avoiding the large input arrays `x_mesh` and `y_mesh`,
-we are using far less memory.
+By avoiding the large input arrays `x_mesh` and `y_mesh`, this `vmap` version uses far less memory.
+
+When run on a CPU, its runtime is similar to that of the meshgrid version.
 
-In addition, `vmap` allows us to break vectorization up into stages, which is
-often easier to comprehend than the traditional approach.
+When run on a GPU, it is usually significantly faster.
 
-This will become more obvious when we tackle larger problems.
+In fact, using `vmap` has another advantage: It allows us to break vectorization up into stages.
+
+This leads to code that is often easier to comprehend than traditional vectorized code.
+
+We will investigate these ideas more when we tackle larger problems.
 
 
 ### vmap version 2
 
 We can be still more memory efficient using vmap.
 
-While we avoided large input arrays in the preceding version, 
+While we avoid large input arrays in the preceding version, 
 we still create the large output array `f(x,y)` before we compute the max.
 
-Let's use a slightly different approach that takes the max to the inside.
+Let's try a slightly different approach that takes the max to the inside.
+
+Because of this change, we never compute the two-dimensional array `f(x,y)`.
 
 ```{code-cell} ipython3
 @jax.jit
@@ -399,23 +417,28 @@ def compute_max_vmap_v2(grid):
     return jnp.max(f_vec_max(grid))
 ```
 
-Let's try it
+Here 
+
+* `f_vec_x_max` computes the max along any given row
+* `f_vec_max` is a vectorized version that can compute the max of all rows in parallel.
+
+We apply this function to all rows and then take the max of the row maxes.
+
+Let's try it.
 
 ```{code-cell} ipython3
 with qe.Timer(precision=8):
     z_max = compute_max_vmap_v2(grid).block_until_ready()
 ```
 
-
 Let's run it again to eliminate compilation time:
 
 ```{code-cell} ipython3
 with qe.Timer(precision=8):
     z_max = compute_max_vmap_v2(grid).block_until_ready()
 ```
 
-We don't get much speed gain but we do save some memory.
-
+If you are running this on a GPU, as we are, you should see another nontrivial speed gain.
 
 
 ### Summary
@@ -497,7 +520,9 @@ Now let's create a JAX version using `lax.scan`:
 from jax import lax
 from functools import partial
 
-@partial(jax.jit, static_argnums=(1,))
+cpu = jax.devices("cpu")[0]
+
+@partial(jax.jit, static_argnums=(1,), device=cpu)
 def qm_jax(x0, n, α=4.0):
     def update(x, t):
         x_new = α * x * (1 - x)
@@ -509,6 +534,16 @@ def qm_jax(x0, n, α=4.0):
 
 This code is not easy to read but, in essence, `lax.scan` repeatedly calls `update` and accumulates the returns `x_new` into an array.
 
+```{note}
+Sharp readers will notice that we specify `device=cpu` in the `jax.jit` decorator.
+
+The computation consists of many very small `lax.scan` iterations that must run sequentially, leaving little opportunity for the GPU to exploit parallelism.
+
+As a result, kernel-launch overhead tends to dominate on the GPU, making the CPU a better fit for this workload.
+
+Curious readers can try removing this option to see how performance changes.
+```
+
 Let's time it with the same parameters:
 
 ```{code-cell} ipython3
diff --git a/lectures/status.md b/lectures/status.md
@@ -31,4 +31,18 @@ and the following package versions
 ```{code-cell} ipython
 :tags: [hide-output]
 !conda list
+```
+
+This lecture series has access to the following GPU
+
+```{code-cell} ipython
+!nvidia-smi
+```
+
+You can check the backend used by JAX using:
+
+```{code-cell} ipython3
+import jax
+# Check if JAX is using GPU
+print(f"JAX backend: {jax.devices()[0].platform}")
 ```
diff --git a/scripts/test-jax-install.py b/scripts/test-jax-install.py
@@ -0,0 +1,21 @@
+import jax
+import jax.numpy as jnp
+
+devices = jax.devices()
+print(f"The available devices are: {devices}")
+
+@jax.jit
+def matrix_multiply(a, b):
+    return jnp.dot(a, b)
+
+# Example usage:
+key = jax.random.PRNGKey(0)
+x = jax.random.normal(key, (1000, 1000))
+y = jax.random.normal(key, (1000, 1000))
+z = matrix_multiply(x, y)
+
+# Now the function is JIT compiled and will likely run on GPU (if available)
+print(z)
+
+devices = jax.devices()
+print(f"The available devices are: {devices}")