Add model backend, add tests #174

PaliC · 2025-09-30T08:26:30Z

Here we introduce model backend. The idea here to start and codify the ideas from jiannanWang/BackendBenchExamples. Specifically, this PR sets up the infra around this suite (however there is more to add I'll add to later). Also let me know if I should break this up (a lot of the code is for testing / examples)

Basic Design
The way this suite works is that it leverages the existing infra we already have from the torchbench suite. In theory these are fairly useful op level correctness / performance tests which we have a path to expanding, so this seems reasonable. The unique thing this suite does is it loads pytorch models and then runs those models using the ops which we are testing (from a directory bench). We then report to the user various results regarding their implementation.

Model registration
This PR creates a way of adding models to the suite and automatically validates them through CI. The way these models are added is detailed in this readme. The tl;dir is we use a format similar to kernelbench and SakanaAI/robust-kbench where we pair model code with a config. Importantly the configs contain initialization code, forward pass arguments (both in a similar format to torchbench), and a list of ops in the forward and backwards passes. These ops are fairly important as they are what we want to point out to the researcher when they are optimizing a model.

We also further verify these registrations are correct through CI. Specifically we run test/test_model_ops_configs.py to ensure the configs are formatted correctly and the operators which are mentioned are testable. test/test_model_ops_coverage.py makes sure that all of the ops which are specified in the config are run through the model (and in turn ensures the models + configs are runnable).

How the suite works
Effectively what this suite does is that it looks at the registered models and grabs all of the ops from the config. It then runs the torchbench suite tests against these ops. (Therefore, it support --topn like torchbench). op filtering is actually done by the model config files themselves, and instead this suite offers a "model-filter" option to filter by model. Afterwards, it leverages directorybench in order to run the models themselves. Right now only correctness testing is supported.

Added models
I added 2 models here as examples.

SmokeTestModel - This is simple model that uses aten.ops.mm as we can implement a correct version of this op
ToyCoreOpsModel - This is a model which explicitly calls the backwards passes which are both in torchbench + core.

Notes for the reviewer
This is a fairly lengthy PR as it adds a very new feature (with examples). So here a breakdown of what each file does
BackendBench/scripts/main.py - CLI changes to support the new op
BackendBench/suites/models/* - example models SmokeTestModel and ToyCoreOpsModel
BackendBench/suite/models/README.md - Readme linked above
BackendBench/suite/model.py - the actual suite code. Primarily this suite just piggybacks off of torchbenchsuite. The two things it adds is testing individual models for correctness + printing out the results of the above
BackendBench/eval_model.py - the actual logic for testing correctness
test/test_model_ops_configs.py - described above in model registration
test/test_model_ops_coverage.py - described above in model registration

test/test_model_suite.py - model suite unit tests

Testing
If you run something like this with a working mm implementation, and everything else relevant being a watermarked implementation

uv run python BackendBench/scripts/main.py --suite model --backend directory --topn 1
You should get an output like this (with the two models we have) which makes sense

...
[2025-10-01 05:59:53][INFO][main.py] ============================================================
[2025-10-01 05:59:53][INFO][main.py] MODEL EVALUATION RESULTS
[2025-10-01 05:59:53][INFO][main.py] ============================================================
[2025-10-01 05:59:53][INFO][model.py]
Model: ToyCoreOpsModel
[2025-10-01 05:59:53][INFO][model.py] Status: ✗ Failed (0/3 tests)
[2025-10-01 05:59:53][INFO][model.py] ✗ small_batch
[2025-10-01 05:59:53][INFO][model.py] Error: Model ToyCoreOpsModel::small_batch failed: Expected number of channels in input to be divisible by num_groups, but got input of shape [2, 3, 32, 32] and num_groups=8
[2025-10-01 05:59:53][INFO][model.py] ✗ medium_batch
[2025-10-01 05:59:53][INFO][model.py] Error: Model ToyCoreOpsModel::medium_batch failed: Expected number of channels in input to be divisible by num_groups, but got input of shape [4, 3, 64, 64] and num_groups=8
[2025-10-01 05:59:53][INFO][model.py] ✗ large_input
[2025-10-01 05:59:53][INFO][model.py] Error: Model ToyCoreOpsModel::large_input failed: Expected number of channels in input to be divisible by num_groups, but got input of shape [2, 3, 128, 128] and num_groups=8
[2025-10-01 05:59:53][INFO][model.py]
Model: SmokeTestModel
[2025-10-01 05:59:53][INFO][model.py] Status: ✓ Passed (3/3 tests)
[2025-10-01 05:59:53][INFO][model.py] ✓ small_batch
[2025-10-01 05:59:53][INFO][model.py] Output match: ✓ Gradients match: ✓ (4 gradients)
[2025-10-01 05:59:53][INFO][model.py] ✓ medium_batch
[2025-10-01 05:59:53][INFO][model.py] Output match: ✓ Gradients match: ✓ (4 gradients)
[2025-10-01 05:59:53][INFO][model.py] ✓ large_batch
[2025-10-01 05:59:53][INFO][model.py] Output match: ✓ Gradients match: ✓ (4 gradients)
[2025-10-01 05:59:53][INFO][main.py] ============================================================
[2025-10-01 05:59:53][INFO][output.py] Full results saved to generated_kernels/full_results.json
[2025-10-01 05:59:53][INFO][output.py] Operator summary CSV saved to generated_kernels/operator_summary.csv
[2025-10-01 05:59:53][INFO][output.py] Failed operations log saved to generated_kernels/failed_tests.json
[2025-10-01 05:59:53][INFO][output.py] Overall summary saved to generated_kernels/OVERALL_SUMMARY.md
[2025-10-01 05:59:53][INFO][output.py] Results saved to directory: /home/dev/repos/BackendBench/generated_kernels
Things to Support later
Numeric stability support over some set of iterations (ie. compare kernelized model against eager for 1000 iterations of training)
Performance testing (jiannanWang/BackendBenchExamples) is a great reference point
Allowing folks to run custom ops for the model and just testing end to end correctness / performance (we will need to think this through more thoroughly to ensure correctness). Effectively this is how we'll end up handling fusions later on.
More models. The two main ones are one that uses all of the backwards passes we currently support and a second is a norms one.
Better integrate correctness results with our logging
Add a simple profiling script where given the model + args, it autogeneratese the forward and backward ops for the config.

jiannanWang · 2025-10-01T21:47:52Z

Hi Sahan, great work! I feel this PR is quite large and hard to review. Would it be possible to split it into smaller, more manageable chunks? For example, you can extract the main functional part and submit that first. Then you can add the examples, and result reporting part in different PRs.

At a high level, I feel this PR combines several components, which makes it a bit complex. For the model backend, I recommend focusing on end-to-end testing only, so the test suite remains simpler and more maintainable.

As a follow-up idea (not for this PR), I noticed that the current model backend requires the model definition code. Would it be possible to support loading models directly from Hugging Face? This would make it easier to test with real models and broaden the scope of testing.

jiannanWang · 2025-10-01T21:51:51Z

BackendBench/suite/model.py

+    return models
+
+
+class ModelSuite(TorchBenchTestSuite):


Do we need to stick to torchbench? Consider making this an argument to choose from different test suites.

BTW why do we need torchbench here? For design I would expect the model backend only tests end-to-end correctness and performance. So the test suite do not overlap with each other.

Testing models end to end is a very different product than testing ops. Model suite atm is designed with the intention of given kernels for this set of ops, we'll benchmark a model for you with your implemention of pytorch ops along with benchmarking a model. Therefore, we would want to benchmark the ops.

As why do them together? Imo design wise it's better to just run one thing to get all of your outputs if possible. IMO the main benefit of just doing end to end model testing is that, it is a bit easier to adapt it to fusions.

For using torchbench specifically, it's more so for simplicity. Inputs are different among the suites, and they test a different set of things. If we expect op coverage given a specific config of a model a researcher shouldn't have the guarantee of "it's covered among all suites", but it should be tighter as "run this suite and it benchmarks all of the ops you need for this model". Torchbench is the best suite here as we can expand it pretty easily.

Fortunately, the torchbench integration is pretty easy to break off into a separate PR, so we can continue discussion there!

jiannanWang · 2025-10-01T21:56:57Z

BackendBench/eval_model.py

+    if backend_enabled:
+        with BackendBench.BackendBench.enable(kernel_dir=kernel_dir):
+            output = model(*args, **kwargs)
+            loss = output.sum()


In my experience this might be troublesome if add or sum kernels are incorrect/do not consider all input dtypes.

I think that should be fine atm as that's how directorybench is structured.

If we change directorybench to work with overloads, this should still work as ops are defined by their overload in the configs.

jiannanWang · 2025-10-01T22:05:10Z

BackendBench/eval_model.py

+    # Collect gradients
+    grads = _collect_gradients(model, args, kwargs)
+
+    return output.detach(), grads


I can see that this function tests the forward pass (output) and the backward pass (grad). What's missing here is the parameter update (optim.step). Is there a reason to not test that? E.g., is it out of the scope of our kernel registration?

Thinking a bit about this, I think it's out of scope for kernel registration as assuming the gradients and output are correct, the optimizer should perform the same thing, and when we do add numerical correctness over training it should be accounted for.

jiannanWang · 2025-10-01T22:09:13Z

BackendBench/suite/model.py

+    return models
+
+
+class ModelSuite(TorchBenchTestSuite):


BTW why do we need torchbench here? For design I would expect the model backend only tests end-to-end correctness and performance. So the test suite do not overlap with each other.

jiannanWang · 2025-10-01T22:17:57Z

BackendBench/eval_model.py

+            model_config,
+            test_args,
+            backend_enabled=False,
+            kernel_dir=kernel_dir,


Do we need kernel_dir here given backend_enabled is False?

fair it's a bit bug prone.

BackendBench/eval_model.py

PaliC · 2025-10-02T02:53:29Z

As a follow-up idea (not for this PR), I noticed that the current model backend requires the model definition code. Would it be possible to support loading models directly from Hugging Face? This would make it easier to test with real models and broaden the scope of testing.

I think this would require some design changes for supporting this (ie. how to handle ops, and the loaders), but yeah I agree. First the important bit is the ability to load and test arbitrary models though.

PaliC · 2025-10-02T08:07:15Z

Broke everything out into this stack here for ease of reviewing #183

Add model backend

8aa5f7c

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 30, 2025

PaliC added 15 commits September 30, 2025 08:28

add tests

843cc97

edits

f2e1f61

edits

20f1889

edits

2ab91b7

edits

36e44c2

edits

228d8e1

edits

1791b8f

edits

b06fa37

edits

cebd34c

edits

b0b224b

edits

92b82e7

edits

8aa4767

edits

fbf6325

edits

f58e0d3

edits

fa78b35

PaliC requested review from jiannanWang and msaroufim October 1, 2025 05:49

PaliC marked this pull request as ready for review October 1, 2025 05:49

jiannanWang reviewed Oct 1, 2025

View reviewed changes

jiannanWang requested changes Oct 1, 2025

View reviewed changes

PaliC changed the title ~~Add model backend~~ Add model backend, add tests Oct 2, 2025

PaliC mentioned this pull request Oct 2, 2025

temp commit #176

Closed

PaliC marked this pull request as draft October 2, 2025 03:50

PaliC closed this Oct 2, 2025

Add model backend, add tests #174

Add model backend, add tests #174

Uh oh!

Conversation

PaliC commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiannanWang commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PaliC Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PaliC Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

PaliC commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PaliC commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PaliC commented Sep 30, 2025 •

edited

Loading

jiannanWang commented Oct 1, 2025 •

edited

Loading

PaliC Oct 2, 2025 •

edited

Loading

PaliC Oct 2, 2025 •

edited

Loading

PaliC commented Oct 2, 2025 •

edited

Loading

PaliC commented Oct 2, 2025 •

edited

Loading