This artifact accompanies SIMT-Step Execution: A Flexible Operational Semantics for GPU Subgroup Behavior. It packages:
- an executable TLA+ model under
forward-progress/validation/ - a SPIR-V-to-TLA+ compiler under
Homunculus/ - basic pipeline-check shaders under
example_shader_program/ - litmus shaders under
litmus_tests/ - empirical Amber suites under
empirical_tests/
This README is organized in the two parts expected by artifact evaluation:
- Part I. Getting Started Guide
- Part II. Step-by-Step Instructions
Generated outputs are written under build/. The build/ directory is not checked into git, so the files described below are generated on demand rather than shipped as precomputed logs.
A reviewer who follows it should be able to confirm that the artifact can:
- translate a shader into a program-specific TLA+ model
- run TLC on the generated model
- expose the main files that correspond to the paper's operational semantics
Required for the basic semantics pipeline:
- Docker Engine
- Git
- Bash shell
Additional requirements for the empirical GPU experiments in Part II:
- a Linux machine where a Docker container can access a Vulkan-capable GPU
- the documented host setups in this artifact are:
- Intel/AMD via
/dev/dri/renderD* - NVIDIA via
nvidia-smiand NVIDIA Container Toolkit configured for Docker
- Intel/AMD via
Support boundary:
- The main TLA+ pipeline in
scripts/docker-run-tlaplus.shonly requires Docker. - The empirical Amber pipeline in
scripts/docker-run-empirical-tests.shis documented only for Linux GPU hosts. - macOS, Windows, and WSL are not documented for the empirical GPU runs.
- Podman is not the documented or validated path for this artifact.
Before continuing, unpack the artifact or clone the repository, then enter the repository root.
Run:
docker version
docker run hello-worldIf Docker reports permission denied while trying to connect to the Docker daemon socket, either:
- rerun the helper scripts with
sudo, or - configure Docker for non-root use and then re-login before retrying
The helper scripts now fail early with a clearer message when Docker is installed but the daemon socket is not accessible to the current user.
Run one minimal end-to-end pipeline check:
scripts/docker-run-tlaplus.sh --input example_shader_program/synchronization/cm.comp --out textWhat this command does:
- builds the main Docker image if needed
- compiles the input shader to SPIR-V
- translates the SPIR-V program into a program-specific TLA+ module
- runs TLC on the generated model
Expected generated outputs under build/:
MCProgram.tlaoutput.txtspirv-asm.txtexample_shader_program/synchronization/cm.comp.spv
Success criteria:
- the command finishes without a Docker error
build/MCProgram.tlaexistsbuild/output.txtexists and contains TLC output
Notes:
- The first run may take a while because it builds the Docker image.
build/MCProgram.tlais overwritten on each new pipeline run.
The quickest useful inspection path is:
-
Confirm that the shader configuration reached the generated model. Open
build/MCProgram.tlaand check the top-level assignments:Scheduler == ...Synchronization == "CM"for the command aboveNumWorkGroups == ...WorkGroupSize == ...SubgroupSize == ...
These values should match the GLSL-side annotations in the shader and show that the frontend emitted a program-specific TLA+ instance rather than a fixed template.
-
Confirm that the frontend generated program content, not just parameters. In
build/MCProgram.tla, inspect:ThreadInstructions == ...ThreadArguments == ...Blocks == ...
These definitions are the concrete thread instruction stream, operands, and control-flow graph extracted from the input shader.
-
Confirm where the generated program connects to the hand-written semantics. Cross-check the generated
build/MCProgram.tlawith:forward-progress/validation/MCThreads.tla, whereExecuteInstructiondispatches instruction handlersforward-progress/validation/MCProgressModel.tla, whereInit,Step, andNextassemble the model checked by TLC
-
Confirm that TLC actually ran on the generated module. In
build/output.txt, look for:Parsing file /workdir/forward-progress/validation/MCProgram.tlaFinished computing initial states- a completed TLC exploration outcome
At this point, a reviewer has completed the basic pipeline-validation path.
If you have a supported Linux GPU environment and want a quick artifact-level experiment after the basic pipeline check, run the evaluator-sized empirical subset:
scripts/docker-run-empirical-tests.shExpected runtime:
- about
5-8minutes with a prebuilt image - about
12-16minutes on the first run if the empirical Docker image must be built
Expected generated outputs under build/empirical_evaluation_results/:
summary.mdsummary.csvall_results.csvvulkaninfo_summary.txt<suite>/*.log
Warnings that are usually safe to ignore in this headless/containerized setting:
'DISPLAY' environment variable not set... skipping surface infoerror: XDG_RUNTIME_DIR not set in the environment.terminator_CreateInstance: Failed to CreateInstance in ICD 0. Skipping ICD.llvmpipeappearing invulkaninfo --summaryalongside a real hardware GPU
These warnings are not fatal by themselves. What matters is that:
- the run produces the output files above
vulkaninfo_summary.txtlists a real hardware GPU- suite logs are written under
build/empirical_evaluation_results/
This part is for evaluators and researchers who want to reproduce the artifact's experiments, connect them back to the paper, and inspect the implementation in more depth.
The artifact supports the following paper claims by providing executable models, bundled litmus suites, orientation examples, and reproduction scripts.
| Claim supported by the artifact | Paper connection | How to evaluate it |
|---|---|---|
| The four direct SIMT-Step models implemented in the artifact can be executed as TLA+ semantics and checked with TLC. | Sec. 4, Sec. 5, and Table 1 | Run scripts/docker-run-tlaplus.sh --litmus-tests and inspect build/litmus_tests_result/summary.txt together with the per-test TLC logs under build/litmus_tests_result/. |
| Dynamic blocks, branching, merge handling, and reconvergence are represented explicitly in the artifact. | Sec. 3 and Sec. 4 | Inspect forward-progress/validation/MCProgram.tla and forward-progress/validation/MCThreads.tla, then compare the generated build/MCProgram.tla after the basic pipeline check in Section 3. |
| The direct models CM, SM, SCF, and SSO from Table 1 are encoded in the executable model and exercised by the bundled litmus suite. | Sec. 4, Sec. 5, and Table 1 | Run scripts/docker-run-tlaplus.sh --litmus-tests on the shaders in litmus_tests/ and inspect build/litmus_tests_result/summary.txt. |
| The artifact provides an end-to-end toolchain from GLSL input to SPIR-V, to generated TLA+, to TLC output. | Sec. 5.1 and Fig. 12 | Inspect scripts/docker-run-tlaplus.sh, Dockerfile, and Homunculus/, then reproduce the basic pipeline check from Section 3. |
| The bundled Amber suites exercise the paper-aligned direct-model test families for CM, SM, SCF, and SSO. | Sec. 5.2, Fig. 2, Fig. 3, and Fig. 9 | Run scripts/docker-run-empirical-tests.sh for the evaluation subset or --full for the larger campaign, then inspect the generated summaries and logs in build/. |
The artifact does not by itself establish the following broader paper claims:
- The full cross-device empirical conclusion from Sec. 6.2.
- The paper's study spans nine GPUs from seven vendors and totals over 700 hours of testing.
- Running the artifact on a single evaluator machine provides partial reproduction of that evidence, not the full study.
- The weaker
SpecandSymbmodels from Table 1.- Sec. 5 states that the tool suite focuses on the first four direct models.
- Broad empirical portability across all OSes and GPU environments.
- The documented empirical path is Linux-only and requires a Vulkan-capable GPU environment.
Run:
scripts/docker-run-tlaplus.sh --input example_shader_program/synchronization/cm.comp --out textThis is the smallest end-to-end reproduction of the artifact's main semantics pipeline. It is the recommended getting-started path for confirming that the toolchain works on your machine.
Generated outputs:
build/MCProgram.tlabuild/output.txtbuild/spirv-asm.txtbuild/example_shader_program/synchronization/cm.comp.spv
Interpretation:
build/MCProgram.tlais the generated TLA+ module specialized to the input shader.build/output.txtis TLC's log for that generated model.
The repository also includes one small orientation shader for each model in the paper:
| Shader | Paper connection | Behavior to inspect |
|---|---|---|
example_shader_program/synchronization/cm.comp |
CM / Fig. 3 | collective memory behavior |
example_shader_program/synchronization/sm.comp |
SM / Fig. 2 | synchronous memory behavior |
example_shader_program/synchronization/scf.comp |
SCF / Fig. 9 without subgroup op | synchronous control-flow behavior |
example_shader_program/synchronization/sso.comp |
SSO / Fig. 9 with subgroup op | subgroup-operation synchronization behavior |
Run all four:
for model in cm sm scf sso; do
scripts/docker-run-tlaplus.sh --input "example_shader_program/synchronization/${model}.comp" --out text
doneIf you want the TLC graph dump as well, use:
scripts/docker-run-tlaplus.sh --input example_shader_program/synchronization/cm.comp --out dotor:
scripts/docker-run-tlaplus.sh --input example_shader_program/synchronization/cm.comp --out allAdditional generated outputs for dot or all:
build/output.dot
The table below makes the manual inspection steps explicit.
| Claim to inspect | File | What to check |
|---|---|---|
| The shader annotations were propagated into the model instance. | build/MCProgram.tla |
Scheduler == ..., Synchronization == ..., NumWorkGroups == ..., WorkGroupSize == ..., SubgroupSize == ... |
| The frontend generated concrete instructions and arguments for this shader. | build/MCProgram.tla |
ThreadInstructions == ... and ThreadArguments == ... |
| The frontend generated a control-flow graph. | build/MCProgram.tla |
Blocks == ... and MergeBlocks == ... |
| The selected synchronization model changes instruction classification as in Table 1. | forward-progress/validation/MCProgram.tla |
CollectiveInstructionSet, SynchronousInstructionSet, IsCollectiveInstruction, and IsSynchronousInstruction |
| The handwritten operational semantics dispatch on those classifications. | forward-progress/validation/MCThreads.tla |
ExecuteInstruction and the collective/synchronous variants of atomic and branch handlers |
| TLC checked the generated model instance. | build/output.txt |
Parsing file ... MCProgram.tla, Finished computing initial states, and a completed TLC outcome |
For the four synchronization examples in Section 3.2, the quickest expected check is:
| Shader | Minimal expected inspection result |
|---|---|
cm.comp |
Synchronization == "CM" and memory operations are treated as collective by the model classification |
sm.comp |
Synchronization == "SM" and memory operations are treated as synchronous rather than collective |
scf.comp |
Synchronization == "SCF" and branch/control-flow instructions are collective while memory operations are not |
sso.comp |
Synchronization == "SSO" and only subgroup instructions are collective |
The paper-facing TLA+ evaluation path is the bundled litmus suite under litmus_tests/.
This is the main TLA+ path corresponding to the litmus tests discussed in the paper.
Unlike the basic pipeline check in Section 3, this path runs the collection of litmus shaders and compares each TLC outcome against the expectation registered in the runner.
Run:
scripts/docker-run-tlaplus.sh --litmus-testsGenerated outputs:
build/litmus_tests_result/summary.txtbuild/litmus_tests_result/<test>.txt
How to interpret the output:
summary.txtrecords one line per litmus shader plus a final aggregate line.PASSmeans the observed TLC outcome matched an expected passing test.XFAILmeans the observed TLC outcome matched an expected failing test.FAILorXPASSmeans the observed outcome differed from the registered expectation, and the script exits non-zero.- Each
<test>.txtfile contains the raw TLC log for that specific litmus shader.
If you want a suite that distinguishes the executable Plain and RA memory models, and your checkout includes the optional model-diff litmus suite, run:
scripts/docker-run-tlaplus.sh --litmus-tests --litmus-suite model-diff --memory-model plain
scripts/docker-run-tlaplus.sh --litmus-tests --litmus-suite model-diff --memory-model raThe paper-aligned Amber suites live under empirical_tests/.
The evaluator-sized subset is in empirical_tests/evaluation_tests/.
Before running the empirical suites, verify that the host and Docker setup expose a real Vulkan-capable GPU. The host-side checks below help identify which supported Linux path the launcher will use:
docker run hello-world
vulkaninfo --summary
# Typical Intel/AMD DRM path:
ls /dev/dri/renderD*
# Typical NVIDIA path:
nvidia-smiIf the NVIDIA path is detected but Docker is not yet configured to expose the GPU to containers, the helper script will report:
error: GPU platform 'nvidia' requires Docker to be configured for NVIDIA Container Toolkit; run 'sudo nvidia-ctk runtime configure --runtime=docker' and restart Docker
In that case, run:
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart dockerThe launcher defaults to --gpu-platform auto:
- it chooses
nvidiawhennvidia-smiworks on the host - otherwise it chooses
drmwhen/dev/dri/renderD*is available
You can override the choice explicitly:
scripts/docker-run-empirical-tests.sh --gpu-platform drm
scripts/docker-run-empirical-tests.sh --gpu-platform nvidiaSuffix meanings:
ww: write-write race patternrw: read-write race patternwr: write-read race pattern
| Suite | Paper mapping | Purpose |
|---|---|---|
cm_wr |
CM | Tests whether memory operations execute collectively. |
sm_ww, sm_rw, sm_wr |
SM | Tests whether memory operations execute synchronously. |
scf_ww, scf_rw, scf_wr |
SCF | Tests whether threads have synchronous control flow. |
sso_ww, sso_rw, sso_wr |
SSO | Tests whether subgroup operations synchronize with the associated control flow. |
Under the current RA-enabled executable semantics for the currently supported global/shared atomics:
scf_wr,sm_wr, andsso_wrare peer-visibility witnesses and are expected to pass underPlainbut fail underRA.scf_ww,sm_ww, andsso_wware indexed two-address overwrite witnesses over a two-element storage-buffer array, and are also expected to pass underPlainbut fail underRA.cm_wrremains a collective-uniformity check rather than an RA-distinguishing witness.
Run:
scripts/docker-run-empirical-tests.shExpected runtime:
- about
5-8minutes with a prebuilt image - about
12-16minutes if the empirical image must be built first
Generated outputs:
build/empirical_evaluation_results/summary.mdbuild/empirical_evaluation_results/summary.csvbuild/empirical_evaluation_results/all_results.csvbuild/empirical_evaluation_results/vulkaninfo_summary.txtbuild/empirical_evaluation_results/gpu_platform.txtbuild/empirical_evaluation_results/<suite>/*.log
How to interpret the output:
summary.mdgives the per-suite pass/fail counts.all_results.csvgives one row per Amber file.vulkaninfo_summary.txtrecords the Vulkan devices visible in the container.<suite>/*.logcontains the raw Amber output for each test.
Important environment note:
- The empirical campaign is environment-sensitive.
- The repository documents Linux
drmand Linuxnvidiapaths only. - The bundled empirical campaign was curated from Intel Iris Xe runs, so exact outcomes may vary on AMD or NVIDIA hardware.
The full empirical campaign is under empirical_tests/full_tests/.
Run:
scripts/docker-run-empirical-tests.sh --fullRuntime note:
- this run may take days, depending on the GPU and host
Smaller input:
- if you want the same workflow on a manageable scale, use the evaluation subset from Section 5 instead
Generated outputs:
build/empirical_full_results/summary.mdbuild/empirical_full_results/summary.csvbuild/empirical_full_results/all_results.csvbuild/empirical_full_results/<suite>/*.log
The main paper-to-artifact connections are:
- Dynamic blocks (Sec. 3).
DynamicBlockinforward-progress/validation/MCProgram.tladefines the record storing the SIS, thread sets, block label, identifier, merge stack, and child blocks.BranchUpdateandBranchConditionalUpdateSubgroupinforward-progress/validation/MCProgram.tlaevolve the dynamic execution graph when control flow splits and reconverges.
- Instruction classes (Sec. 4 and Table 1).
IsCollectiveInstructionandIsSynchronousInstructioninforward-progress/validation/MCProgram.tlaencode the direct-model classification used by the executable semantics.
- Dynamic-block evolution (Sec. 4).
- Independent branching is handled by
BranchUpdate,OpBranch, andOpBranchConditional. - Collective control-flow entry and exit are handled by
BranchConditionalUpdateSubgroup,OpBranchCollective,OpBranchConditionalCollective, andOpLabelCollective.
- Independent branching is handled by
- Thread-level semantics (Sec. 4).
ExecuteInstructioninforward-progress/validation/MCThreads.tladispatches to memory, collective control-flow, and subgroup-operation handlers.
- System-level specification (Sec. 4).
forward-progress/validation/MCProgressModel.tlaassembles the program, threads, scheduler,Init,Step, andNextrelations used by TLC.
- Initial state.
InitPrograminforward-progress/validation/MCProgram.tlaexpands toInitDB /\ InitGPU /\ InitGlobalCounter.InitThreadsinforward-progress/validation/MCThreads.tlainitializes per-thread PCs and local state.InitSchedulerandInitStateinforward-progress/validation/MCProgressModel.tlainitialize the chosen scheduler (HSAorOBE) and model state.
- Transition relation.
StepandNextinforward-progress/validation/MCProgressModel.tlainvokeExecuteInstructionandUpdateFairExecutionSetto advance execution while enforcing the scheduler and fairness structure.- Instruction handlers in
forward-progress/validation/MCThreads.tlaperform the primed assignments, and branching handlers call back intoforward-progress/validation/MCProgram.tlato evolve dynamic blocks.
If you want to modify or extend the artifact, the shortest useful mental model is:
-
Input shader.
- Start from a GLSL compute shader under
example_shader_program/for a basic pipeline check or underlitmus_tests/for the paper-facing TLA+ evaluation path. - The shader carries TLA+-specific annotations such as
#pragma scheduler(...)andlayout(tla_synchronization_id = ...) in;.
- Start from a GLSL compute shader under
-
Shader compilation and driver script.
scripts/docker-run-tlaplus.shis the main user-facing entrypoint.- It invokes the modified
glslangtoolchain, emits SPIR-V, runs the Homunculus frontend, and then launches TLC.
-
SPIR-V frontend and code generation.
Homunculus/parses the SPIR-V plus the custom annotations and writes the generatedbuild/MCProgram.tla.- This is where shader-specific facts such as
Synchronization,ThreadInstructions,ThreadArguments, andBlocksare emitted.
-
Hand-written executable semantics.
forward-progress/validation/MCProgram.tlacontains the dynamic-block data structures, instruction classification, and control-flow update logic.forward-progress/validation/MCThreads.tlacontains thread-level instruction handlers and the mainExecuteInstructiondispatcher.forward-progress/validation/MCProgressModel.tlabuilds the top-level TLA+ model checked by TLC.
-
Generated outputs and experiment drivers.
build/holds generated SPIR-V, generated TLA+, TLC logs, and empirical summaries.scripts/docker-run-empirical-tests.shis the user-facing entrypoint for the Amber campaign.empirical_tests/contains the evaluator-sized subset and the larger full campaign.
In practice, the most common modification paths are:
- Add or change a shader input:
edit the GLSL file under
example_shader_program/for a basic pipeline check or underlitmus_tests/for the paper-facing litmus suite, rerunscripts/docker-run-tlaplus.sh, and inspect the regenerated outputs. - Change the executable semantics of an existing model:
edit
forward-progress/validation/MCProgram.tlaand, if instruction behavior changes,forward-progress/validation/MCThreads.tla. - Add a new synchronization model: follow Appendix B below.
The shaders under example_shader_program/ and litmus_tests/ use small GLSL-side annotations that configure the TLA+ model.
#pragma scheduler(<scheduler name>)Supported schedulers:
HSAOBE
layout(tla_subgroup_size = <num>) in;<num> must be a non-zero positive integer.
layout(tla_num_workgroups = <num>) in;<num> must be a non-zero positive integer.
layout(tla_synchronization_id = <id>) in;id |
Label | Collective instructions | Synchronous instructions | Independent instructions |
|---|---|---|---|---|
1 |
SSO | subgroup ops (OpGroup*) |
none | all remaining instructions |
2 |
SCF | subgroup ops plus control flow | none | others |
3 |
SM | subgroup ops plus control flow | OpAtomicLoad, OpAtomicStore, OpAtomicOr |
others |
4 |
CM | subgroup ops plus control flow plus all memory ops | none | others |
This appendix is for readers who want to add a fifth direct model or modify how one of the current models classifies instructions.
The shortest safe workflow is:
-
Extend the frontend encoding of the model identifier.
- The GLSL-side annotation is
layout(tla_synchronization_id = <id>) in;. - Homunculus parses that identifier and currently rejects values larger than
4. - Update the parser/checking logic under
Homunculus/compiler/src/codegen/context.rsso the new identifier is accepted.
- The GLSL-side annotation is
-
Map the new identifier to a model label in the generated TLA+.
Homunculus/compiler/src/codegen/common.rsmaps the numeric identifier to the emittedSynchronization == "..."assignment.- Add the new numeric case there so generated
build/MCProgram.tlafiles carry a stable label for the new model.
-
Decide how the new model classifies instructions.
forward-progress/validation/MCProgram.tladefines:CollectiveInstructionSetSynchronousInstructionSetIndependentInstructionSetIsCollectiveInstructionIsSynchronousInstruction
- For many model extensions, this is the main place that must change.
- If the new model only reclassifies existing instructions, these definitions may be sufficient.
-
Update dynamic-block behavior if the new model changes collective control flow.
forward-progress/validation/MCProgram.tlaalso contains the dynamic-block machinery.- The key control-flow update operators are:
BranchUpdateBranchConditionalUpdateSubgroup
- If the new model changes when divergence, reconvergence, or subgroup-wide branching should be collective, inspect these operators carefully.
-
Update instruction dispatch if the new model needs new execution behavior.
forward-progress/validation/MCThreads.tlausesExecuteInstructionto choose between collective, synchronous, and independent handlers.- For example, atomic instructions already have separate collective, synchronous, and independent code paths.
- If the new model requires a genuinely new execution mode rather than a reclassification of existing modes, this file will need new handlers and new dispatch cases.
-
Regenerate and inspect a minimal example.
- Add a small shader under
example_shader_program/that uses the new synchronization id. - Run
scripts/docker-run-tlaplus.shon that shader. - Confirm in
build/MCProgram.tlathat:Synchronization == "..."uses the new label- the generated
ThreadInstructions,ThreadArguments, andBlocksare sensible - the instruction classification in the hand-written modules matches the intended semantics
- Add a small shader under
-
Add a regression test.
- Add at least one minimal example or litmus-style shader that distinguishes the new model from the existing four.
- Document the expected result in this README so future evaluators do not have to infer it from raw TLC output.
The minimum files to inspect for this extension path are therefore:
Homunculus/compiler/src/codegen/context.rsHomunculus/compiler/src/codegen/common.rsforward-progress/validation/MCProgram.tlaforward-progress/validation/MCThreads.tlaforward-progress/validation/MCProgressModel.tla
Supported instructions:
- Variables and memory:
OpVariable,OpLoad,OpStore,OpAtomicLoad,OpAtomicStore
- Control flow:
OpBranch,OpBranchConditional,OpSwitch,OpSelectionMerge,OpLoopMerge,OpLabel,OpReturn
- Logic and comparisons:
OpLogicalOr,OpLogicalAnd,OpLogicalEqual,OpLogicalNotEqual,OpLogicalNot,OpEqual,OpNotEqual,OpLess,OpLessOrEqual,OpGreater,OpGreaterOrEqual
- Arithmetic and atomics:
OpAdd,OpSub,OpMul,OpAtomicAdd,OpAtomicSub,OpAtomicExchange,OpAtomicCompareExchange
- Subgroup and synchronization:
OpGroupAll,OpGroupAny,OpGroupNonUniformAll,OpGroupNonUniformAny,OpControlBarrier
Notes:
- Supported scalar types are:
intuintbool
- The executable semantics and bundled tests focus on the four direct SIMT-Step models: CM, SM, SCF, and SSO.
- The weaker
SpecandSymbmodels discussed in Table 1 are not implemented in the artifact toolchain. - The artifact supports the SPIR-V and GLSL subset needed by the included examples and tests; it is not intended as a complete SPIR-V implementation.
- The model uses sequentially consistent-style reasoning in the executable semantics, while the empirical Vulkan tests necessarily use the strongest portable Vulkan atomics available to Amber and GLSL.