Add quizzes for examples (#690)

Originally motivated by creating examples for Mario's scheduling editor user study, this commit adds three quizzes in the example directory. --------- Co-authored-by: Yuka Ikarashi <[email protected]>
exo-lang · Nov 10, 2024 · 3d02a16 · 3d02a16
1 parent 2372d0c
commit 3d02a16
Show file tree

Hide file tree

Showing 16 changed files with 1,087 additions and 0 deletions.
diff --git a/examples/README.md b/examples/README.md
@@ -8,3 +8,6 @@ If you are new to Exo, we recommend going through the examples in the following
 2. [Cursor](./cursors/README.md): This example shows how to use Cursors to efficiently write schedules and define a new scheduling operator.
 
 3. [RVM](./rvm_conv1d/README.md): This example illustrates how to use Exo to define and target a new hardware accelerator entirely in the user code.
+
+4. Quizzes ([quiz1](./quiz1/README.md), [quiz2](./quiz2/README.md), [quiz3](./quiz3/README.md)) contain common scheduling mistakes in Exo and solutions to fix them. The best way to learn a programming language is by debugging code.
+
diff --git a/examples/quiz1/.gitignore b/examples/quiz1/.gitignore
@@ -0,0 +1 @@
+quiz1/
diff --git a/examples/quiz1/README.md b/examples/quiz1/README.md
@@ -0,0 +1,59 @@
+# Quiz 1
+
+Throughout the quiz, we provide incorrect code and the correct output as a reference. Your goal is to understand the code and fix the bug to match the correct output!
+
+You can execute `quiz1.py` by running `exocc quiz1.py`. Without modification, it will show the incorrect output.
+
+## Incorrect Output
+
+The following output is incorrect because it does not make calls to vector intrinsics. While it matches the structure of SIMD vector code, it is still being executed one element at a time:
+
+```python
+def double(N: size, inp: f32[N] @ DRAM, out: f32[N] @ DRAM):
+    assert N % 8 == 0
+    two_vec: R[8] @ DRAM
+    for ii in seq(0, 8):
+        two_vec[ii] = 2.0
+    for io in seq(0, N / 8):
+        out_vec: f32[8] @ DRAM
+        inp_vec: f32[8] @ DRAM
+        for i0 in seq(0, 8):
+            inp_vec[i0] = inp[i0 + 8 * io]
+        for ii in seq(0, 8):
+            out_vec[ii] = two_vec[ii] * inp_vec[ii]
+        for i0 in seq(0, 8):
+            out[i0 + 8 * io] = out_vec[i0]
+```
+
+## Correct Output
+
+The correct output optimizes the function to use vectorized arithmetic operations to compute the result over the entire array:
+
+```python
+def double(N: size, inp: f32[N] @ DRAM, out: f32[N] @ DRAM):
+    assert N % 8 == 0
+    two_vec: R[8] @ AVX2
+    vector_assign_two(two_vec[0:8])
+    for io in seq(0, N / 8):
+        out_vec: f32[8] @ AVX2
+        inp_vec: f32[8] @ AVX2
+        vector_load(inp_vec[0:8], inp[8 * io + 0:8 * io + 8])
+        vector_multiply(out_vec[0:8], two_vec[0:8], inp_vec[0:8])
+        vector_store(out[8 * io + 0:8 * io + 8], out_vec[0:8])
+```
+
+---
+
+## Solution
+
+Before calling `replace_all(p, avx_instrs)`, you need to set buffer memory annotations to AVX2, because `replace_all` is memory-aware and will only replace code chunks with instructions that have matching memory annotations.
+
+Add the following code before the call to `replace_all`:
+
+```python
+    # Set the memory types to be AVX2 vectors
+    for name in ["two", "out", "inp"]:
+        p = set_memory(p, f"{name}_vec", AVX2)
+```
+
+This will ensure that the memory annotations are correctly set to AVX2 before replacing the code with vector intrinsics.
diff --git a/examples/quiz1/quiz1.py b/examples/quiz1/quiz1.py
@@ -0,0 +1,84 @@
+from __future__ import annotations
+
+from exo import *
+from exo.libs.memories import AVX2
+from exo.stdlib.scheduling import *
+
+
+@instr("{dst_data} = _mm256_loadu_ps(&{src_data});")
+def vector_load(dst: [f32][8] @ AVX2, src: [f32][8] @ DRAM):
+    assert stride(src, 0) == 1
+    assert stride(dst, 0) == 1
+
+    for i in seq(0, 8):
+        dst[i] = src[i]
+
+
+@instr("_mm256_storeu_ps(&{dst_data}, {src_data});")
+def vector_store(dst: [f32][8] @ DRAM, src: [f32][8] @ AVX2):
+    assert stride(src, 0) == 1
+    assert stride(dst, 0) == 1
+
+    for i in seq(0, 8):
+        dst[i] = src[i]
+
+
+@instr("{out_data} = _mm256_mul_ps({x_data}, {y_data});")
+def vector_multiply(out: [f32][8] @ AVX2, x: [f32][8] @ AVX2, y: [f32][8] @ AVX2):
+    assert stride(out, 0) == 1
+    assert stride(x, 0) == 1
+    assert stride(y, 0) == 1
+
+    for i in seq(0, 8):
+        out[i] = x[i] * y[i]
+
+
+@instr("{out_data} = _mm256_broadcast_ss(2.0);")
+def vector_assign_two(out: [f32][8] @ AVX2):
+    assert stride(out, 0) == 1
+
+    for i in seq(0, 8):
+        out[i] = 2.0
+
+
+@proc
+def vec_double(N: size, inp: f32[N], out: f32[N]):
+    assert N % 8 == 0
+    for i in seq(0, N):
+        out[i] = 2.0 * inp[i]
+
+
+def wrong_schedule(p):
+    """
+    Forgot to set the memory types to be AVX2 vectors, so replace instruction
+    does not work as intended.
+    """
+    p = rename(p, "vec_double_optimized")
+    p = divide_loop(p, "i", 8, ["io", "ii"], perfect=True)
+
+    # Create a vector of twos
+    p = bind_expr(p, "2.0", "two_vec")
+    two_alloc = p.find("two_vec: _")
+    two_assign = p.find("two_vec = _")
+    p = expand_dim(p, two_alloc, 8, "ii")
+
+    # Hoist the allocation and assignment of two vector
+    p = lift_alloc(p, two_alloc, 2)
+    p = fission(p, two_assign.after(), 2)
+    p = remove_loop(p, two_assign.parent().parent())
+
+    # Create vectors for the input and output values
+    innermost_loop = p.find_loop("ii #1")
+    p = stage_mem(p, innermost_loop, "out[8*io:8*io+8]", "out_vec")
+    p = stage_mem(p, innermost_loop, "inp[8*io:8*io+8]", "inp_vec")
+    p = simplify(p)
+
+    # Replace with AVX instructinos
+    avx_instrs = [vector_assign_two, vector_multiply, vector_load, vector_store]
+    p = replace_all(p, avx_instrs)
+
+    return p
+
+
+w = wrong_schedule(vec_double)
+print(w)
diff --git a/examples/quiz2/.gitignore b/examples/quiz2/.gitignore
@@ -0,0 +1 @@
+quiz2/
diff --git a/examples/quiz2/README.md b/examples/quiz2/README.md
@@ -0,0 +1,101 @@
+# Quiz2!
+
+This quiz is about loop fission bugs and debugging via printing cursors.
+
+## Incorrect output (compiler error)
+As written, the schedule has a bug which attempts to incorrectly fission a loop.
+```bash
+Traceback (most recent call last):
+  File "/home/yuka/.local/bin/exocc", line 8, in <module>
+    sys.exit(main())
+  File "/home/yuka/.local/lib/python3.9/site-packages/exo/main.py", line 55, in main
+    library = [
+  File "/home/yuka/.local/lib/python3.9/site-packages/exo/main.py", line 58, in <listcomp>
+    for proc in get_procs_from_module(load_user_code(mod))
+  File "/home/yuka/.local/lib/python3.9/site-packages/exo/main.py", line 107, in load_user_code
+    loader.exec_module(user_module)
+  File "<frozen importlib._bootstrap_external>", line 790, in exec_module
+  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
+  File "/home/yuka/exo/examples/quiz2/quiz2.py", line 42, in <module>
+    w = wrong_schedule(scaled_add)
+  File "/home/yuka/exo/examples/quiz2/quiz2.py", line 38, in wrong_schedule
+    p = fission(p, vector_assign.after())
+  File "/home/yuka/.local/lib/python3.9/site-packages/exo/API_scheduling.py", line 100, in __call__
+    return self.func(*bound_args.args, **bound_args.kwargs)
+  File "/home/yuka/.local/lib/python3.9/site-packages/exo/API_scheduling.py", line 2066, in fission
+    ir, fwd = scheduling.DoFissionAfterSimple(
+  File "/home/yuka/.local/lib/python3.9/site-packages/exo/rewrite/LoopIR_scheduling.py", line 2385, in DoFissionAfterSimple
+    alloc_check(pre, post)
+  File "/home/yuka/.local/lib/python3.9/site-packages/exo/rewrite/LoopIR_scheduling.py", line 2352, in alloc_check
+    raise SchedulingError(
+exo.rewrite.new_eff.SchedulingError: <<<unknown directive>>>: Will not fission here, because doing so will hide the allocation of vec from a later use site.
+```
+
+## Correct Output
+The correct output will divide the computation into individual, vectorizable loops.
+```python
+def scaled_add_scheduled(N: size, a: f32[N] @ DRAM, b: f32[N] @ DRAM,
+                         c: f32[N] @ DRAM):
+    assert N % 8 == 0
+    for io in seq(0, N / 8):
+        vec: R[8] @ DRAM
+        vec_1: R[8] @ DRAM
+        vec_2: f32[8] @ DRAM
+        vec_3: R[8] @ DRAM
+        vec_4: R[8] @ DRAM
+        vec_5: f32[8] @ DRAM
+        for ii in seq(0, 8):
+            vec_1[ii] = 2
+        for ii in seq(0, 8):
+            vec_2[ii] = a[8 * io + ii]
+        for ii in seq(0, 8):
+            vec[ii] = vec_1[ii] * vec_2[ii]
+        for ii in seq(0, 8):
+            vec_4[ii] = 3
+        for ii in seq(0, 8):
+            vec_5[ii] = b[8 * io + ii]
+        for ii in seq(0, 8):
+            vec_3[ii] = vec_4[ii] * vec_5[ii]
+        for ii in seq(0, 8):
+            c[8 * io + ii] = vec[ii] + vec_3[ii]
+```
+
+---
+
+## Solution
+
+To understand the bug, let's first try printing right before the error. Add the following line after line 37:
+```python
+print(vector_assign.after())
+```
+
+This will output:
+```python
+    for io in seq(0, N / 8):
+        vec: R[8] @ DRAM
+        for ii in seq(0, 8):
+            vec_1: R @ DRAM
+            vec_1 = 2
+            [GAP - After]
+            ...
+```
+
+The code is attempting to perform fission at the `[GAP - After]` location.
+However, this is unsafe because the `vec_1: R` allocation is within the `ii` loop and before the fission point.
+If `vec_1` is used after the fission point, the code will no longer be a valid Exo.
+
+To fix this issue, modify the code as follows:
+
+```python
+    for i in range(num_vectors):
+        vector_reg = p.find(f"vec: _ #{i}")
+        p = expand_dim(p, vector_reg, 8, "ii")
+        p = lift_alloc(p, vector_reg)
+
+    for i in range(num_vectors):
+        vector_assign = p.find(f"vec = _ #{i}")
+        p = fission(p, vector_assign.after())
+```
+
+By separating the allocation lifting and fission operations into two separate loops, you ensure that all the allocations are lifted out of the loop before performing fission. This resolves the issue of unsafe fission due to the allocation being within the loop.
+
diff --git a/examples/quiz2/quiz2.py b/examples/quiz2/quiz2.py
@@ -0,0 +1,46 @@
+from __future__ import annotations
+
+from exo import *
+from exo.stdlib.scheduling import *
+
+
+@proc
+def scaled_add(N: size, a: f32[N], b: f32[N], c: f32[N]):
+    assert N % 8 == 0
+    for i in seq(0, N):
+        c[i] = 2 * a[i] + 3 * b[i]
+
+
+def stage_exprs(p, num_vectors, assign):
+    if isinstance(assign.rhs(), BinaryOpCursor):
+        p = bind_expr(p, assign.rhs().lhs(), "vec")
+        num_vectors += 1
+        p, num_vectors = stage_exprs(p, num_vectors, p.forward(assign).prev())
+
+        p = bind_expr(p, assign.rhs().rhs(), "vec")
+        num_vectors += 1
+        p, num_vectors = stage_exprs(p, num_vectors, p.forward(assign).prev())
+    return p, num_vectors
+
+
+def wrong_schedule(p):
+    p = rename(p, "scaled_add_scheduled")
+    num_vectors = 0
+
+    p = divide_loop(p, "i", 8, ["io", "ii"], perfect=True)
+
+    p, num_vectors = stage_exprs(p, num_vectors, p.find("c[_] = _"))
+
+    for i in range(num_vectors):
+        vector_reg = p.find(f"vec: _ #{i}")
+        p = expand_dim(p, vector_reg, 8, "ii")
+        p = lift_alloc(p, vector_reg)
+
+        vector_assign = p.find(f"vec = _ #{i}")
+        p = fission(p, vector_assign.after())
+
+    return p
+
+
+w = wrong_schedule(scaled_add)
+print(w)
diff --git a/examples/quiz3/.gitignore b/examples/quiz3/.gitignore
@@ -0,0 +1 @@
+quiz3/
diff --git a/examples/quiz3/README.md b/examples/quiz3/README.md
@@ -0,0 +1,89 @@
+# Quiz3!!
+
+This quiz explores fixing subtle cursor navigation bugs.
+
+## Correct Output
+This code makes the optimization of shrinking the `blur_x` memory allocation from (H+2, W) to (34, 256). Since the code has been tiled, we don't need to store the entire intermediate `blur_x` buffer in memory. Instead, we can just reuse the same intermediate buffer for each tile.
+
+To do so, the schedule tries to sink the allocation within the tile, reduce the memory size to the bare minimum necessary for computing that tile, and then lift the allocation back up to the top level scope.
+```python
+def tile_and_fused_blur(W: size, H: size, blur_y: ui16[H, W] @ DRAM,
+                        inp: ui16[H + 2, W + 2] @ DRAM):
+    assert H % 32 == 0
+    assert W % 256 == 0
+    blur_x: ui16[34, 256] @ DRAM
+    for yo in seq(0, H / 32):
+        for xo in seq(0, W / 256):
+            for yi in seq(0, 34):
+                for xi in seq(0, 256):
+                    blur_x[yi + 32 * yo - 32 * yo, xi + 256 * xo - 256 *
+                           xo] = (inp[yi + 32 * yo, xi + 256 * xo] +
+                                  inp[yi + 32 * yo, 1 + xi + 256 * xo] +
+                                  inp[yi + 32 * yo, 2 + xi + 256 * xo]) / 3.0
+            for yi in seq(0, 32):
+                for xi in seq(0, 256):
+                    blur_y[yi + 32 * yo, xi +
+                           256 * xo] = (blur_x[yi + 32 * yo - 32 * yo,
+                                               xi + 256 * xo - 256 * xo] +
+                                        blur_x[1 + yi + 32 * yo - 32 * yo,
+                                               xi + 256 * xo - 256 * xo] +
+                                        blur_x[2 + yi + 32 * yo - 32 * yo,
+                                               xi + 256 * xo - 256 * xo]) / 3.0
+```
+
+## Incorrect Output
+This output is partially correct: it manages to reduce the height dimension from `H+2` to `34`. However, it fails to reduce the memory usage in the width direction.
+```python
+def tile_and_fused_blur(W: size, H: size, blur_y: ui16[H, W] @ DRAM,
+                        inp: ui16[H + 2, W + 2] @ DRAM):
+    assert H % 32 == 0
+    assert W % 256 == 0
+    blur_x: ui16[34, W] @ DRAM
+    for yo in seq(0, H / 32):
+        for xo in seq(0, W / 256):
+            for yi in seq(0, 34):
+                for xi in seq(0, 256):
+                    blur_x[yi + 32 * yo - 32 * yo, xi + 256 *
+                           xo] = (inp[yi + 32 * yo, xi + 256 * xo] +
+                                  inp[yi + 32 * yo, 1 + xi + 256 * xo] +
+                                  inp[yi + 32 * yo, 2 + xi + 256 * xo]) / 3.0
+            for yi in seq(0, 32):
+                for xi in seq(0, 256):
+                    blur_y[yi + 32 * yo, xi + 256 * xo] = (
+                        blur_x[yi + 32 * yo - 32 * yo, xi + 256 * xo] +
+                        blur_x[1 + yi + 32 * yo - 32 * yo, xi + 256 * xo] +
+                        blur_x[2 + yi + 32 * yo - 32 * yo,
+                               xi + 256 * xo]) / 3.0
+```
+
+---
+
+## Solution
+
+To understand the bug, let's insert print statements in these places:
+
+```python
+print(xo_loop)
+loops_to_lower_allocation_into = get_loops_at_or_above(xo_loop)
+for i, loop in enumerate(loops_to_lower_allocation_into):
+    print(i, loop)
+    ...
+```
+
+The `xo_loop` points to:
+```python
+for yo in seq(0, H / 32):
+    for xo in seq(0, W / 256):  # <-- NODE
+        ...
+```
+
+And the first (and only) iteration of the `loop` points to:
+```python
+for yo in seq(0, H / 32):  # <-- NODE
+    for xo in seq(0, W / 256):
+      ...
+```
+
+This reveals that the implementation of `get_loops_at_or_above` has a bug because it only contains "loops above" the `xo_loop` (which is `yo` loop), not including the `xo_loop` itself.
+
+To fix this bug, change `loops = []` to `loops = [cursor]` in the implementation of `get_loops_at_or_above`.
Original file line number	Diff line number	Diff line change
Expand Up		@@ -8,3 +8,6 @@ If you are new to Exo, we recommend going through the examples in the following
		2. [Cursor](./cursors/README.md): This example shows how to use Cursors to efficiently write schedules and define a new scheduling operator.

		3. [RVM](./rvm_conv1d/README.md): This example illustrates how to use Exo to define and target a new hardware accelerator entirely in the user code.

		4. Quizzes ([quiz1](./quiz1/README.md), [quiz2](./quiz2/README.md), [quiz3](./quiz3/README.md)) contain common scheduling mistakes in Exo and solutions to fix them. The best way to learn a programming language is by debugging code.