Skip to content

Conversation

@PaliC
Copy link
Contributor

@PaliC PaliC commented Sep 30, 2025

Here we introduce model backend. The idea here to start and codify the ideas from jiannanWang/BackendBenchExamples. Specifically, this PR sets up the infra around this suite (however there is more to add I'll add to later). Also let me know if I should break this up (a lot of the code is for testing / examples)

Basic Design
The way this suite works is that it leverages the existing infra we already have from the torchbench suite. In theory these are fairly useful op level correctness / performance tests which we have a path to expanding, so this seems reasonable. The unique thing this suite does is it loads pytorch models and then runs those models using the ops which we are testing (from a directory bench). We then report to the user various results regarding their implementation.

Model registration
This PR creates a way of adding models to the suite and automatically validates them through CI. The way these models are added is detailed in this readme. The tl;dir is we use a format similar to kernelbench and SakanaAI/robust-kbench where we pair model code with a config. Importantly the configs contain initialization code, forward pass arguments (both in a similar format to torchbench), and a list of ops in the forward and backwards passes. These ops are fairly important as they are what we want to point out to the researcher when they are optimizing a model.

We also further verify these registrations are correct through CI. Specifically we run test/test_model_ops_configs.py to ensure the configs are formatted correctly and the operators which are mentioned are testable. test/test_model_ops_coverage.py makes sure that all of the ops which are specified in the config are run through the model (and in turn ensures the models + configs are runnable).

How the suite works
Effectively what this suite does is that it looks at the registered models and grabs all of the ops from the config. It then runs the torchbench suite tests against these ops. (Therefore, it support --topn like torchbench). op filtering is actually done by the model config files themselves, and instead this suite offers a "model-filter" option to filter by model. Afterwards, it leverages directorybench in order to run the models themselves. Right now only correctness testing is supported.

Added models
I added 2 models here as examples.

SmokeTestModel - This is simple model that uses aten.ops.mm as we can implement a correct version of this op
ToyCoreOpsModel - This is a model which explicitly calls the backwards passes which are both in torchbench + core.

Notes for the reviewer
This is a fairly lengthy PR as it adds a very new feature (with examples). So here a breakdown of what each file does
BackendBench/scripts/main.py - CLI changes to support the new op
BackendBench/suites/models/* - example models SmokeTestModel and ToyCoreOpsModel
BackendBench/suite/models/README.md - Readme linked above
BackendBench/suite/model.py - the actual suite code. Primarily this suite just piggybacks off of torchbenchsuite. The two things it adds is testing individual models for correctness + printing out the results of the above
BackendBench/eval_model.py - the actual logic for testing correctness
test/test_model_ops_configs.py - described above in model registration
test/test_model_ops_coverage.py - described above in model registration

test/test_model_suite.py - model suite unit tests

Testing
If you run something like this with a working mm implementation, and everything else relevant being a watermarked implementation

uv run python BackendBench/scripts/main.py --suite model --backend directory --topn 1
You should get an output like this (with the two models we have) which makes sense

...
[2025-10-01 05:59:53][INFO][main.py] ============================================================
[2025-10-01 05:59:53][INFO][main.py] MODEL EVALUATION RESULTS
[2025-10-01 05:59:53][INFO][main.py] ============================================================
[2025-10-01 05:59:53][INFO][model.py]
Model: ToyCoreOpsModel
[2025-10-01 05:59:53][INFO][model.py] Status: ✗ Failed (0/3 tests)
[2025-10-01 05:59:53][INFO][model.py] ✗ small_batch
[2025-10-01 05:59:53][INFO][model.py] Error: Model ToyCoreOpsModel::small_batch failed: Expected number of channels in input to be divisible by num_groups, but got input of shape [2, 3, 32, 32] and num_groups=8
[2025-10-01 05:59:53][INFO][model.py] ✗ medium_batch
[2025-10-01 05:59:53][INFO][model.py] Error: Model ToyCoreOpsModel::medium_batch failed: Expected number of channels in input to be divisible by num_groups, but got input of shape [4, 3, 64, 64] and num_groups=8
[2025-10-01 05:59:53][INFO][model.py] ✗ large_input
[2025-10-01 05:59:53][INFO][model.py] Error: Model ToyCoreOpsModel::large_input failed: Expected number of channels in input to be divisible by num_groups, but got input of shape [2, 3, 128, 128] and num_groups=8
[2025-10-01 05:59:53][INFO][model.py]
Model: SmokeTestModel
[2025-10-01 05:59:53][INFO][model.py] Status: ✓ Passed (3/3 tests)
[2025-10-01 05:59:53][INFO][model.py] ✓ small_batch
[2025-10-01 05:59:53][INFO][model.py] Output match: ✓ Gradients match: ✓ (4 gradients)
[2025-10-01 05:59:53][INFO][model.py] ✓ medium_batch
[2025-10-01 05:59:53][INFO][model.py] Output match: ✓ Gradients match: ✓ (4 gradients)
[2025-10-01 05:59:53][INFO][model.py] ✓ large_batch
[2025-10-01 05:59:53][INFO][model.py] Output match: ✓ Gradients match: ✓ (4 gradients)
[2025-10-01 05:59:53][INFO][main.py] ============================================================
[2025-10-01 05:59:53][INFO][output.py] Full results saved to generated_kernels/full_results.json
[2025-10-01 05:59:53][INFO][output.py] Operator summary CSV saved to generated_kernels/operator_summary.csv
[2025-10-01 05:59:53][INFO][output.py] Failed operations log saved to generated_kernels/failed_tests.json
[2025-10-01 05:59:53][INFO][output.py] Overall summary saved to generated_kernels/OVERALL_SUMMARY.md
[2025-10-01 05:59:53][INFO][output.py] Results saved to directory: /home/dev/repos/BackendBench/generated_kernels
Things to Support later
Numeric stability support over some set of iterations (ie. compare kernelized model against eager for 1000 iterations of training)
Performance testing (jiannanWang/BackendBenchExamples) is a great reference point
Allowing folks to run custom ops for the model and just testing end to end correctness / performance (we will need to think this through more thoroughly to ensure correctness). Effectively this is how we'll end up handling fusions later on.
More models. The two main ones are one that uses all of the backwards passes we currently support and a second is a norms one.
Better integrate correctness results with our logging
Add a simple profiling script where given the model + args, it autogeneratese the forward and backward ops for the config.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 30, 2025
@PaliC PaliC requested review from jiannanWang and msaroufim October 1, 2025 05:49
@PaliC PaliC marked this pull request as ready for review October 1, 2025 05:49
@jiannanWang
Copy link
Contributor

jiannanWang commented Oct 1, 2025

Hi Sahan, great work! I feel this PR is quite large and hard to review. Would it be possible to split it into smaller, more manageable chunks? For example, you can extract the main functional part and submit that first. Then you can add the examples, and result reporting part in different PRs.

At a high level, I feel this PR combines several components, which makes it a bit complex. For the model backend, I recommend focusing on end-to-end testing only, so the test suite remains simpler and more maintainable.

As a follow-up idea (not for this PR), I noticed that the current model backend requires the model definition code. Would it be possible to support loading models directly from Hugging Face? This would make it easier to test with real models and broaden the scope of testing.

return models


class ModelSuite(TorchBenchTestSuite):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to stick to torchbench? Consider making this an argument to choose from different test suites.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW why do we need torchbench here? For design I would expect the model backend only tests end-to-end correctness and performance. So the test suite do not overlap with each other.

Copy link
Contributor Author

@PaliC PaliC Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing models end to end is a very different product than testing ops. Model suite atm is designed with the intention of given kernels for this set of ops, we'll benchmark a model for you with your implemention of pytorch ops along with benchmarking a model. Therefore, we would want to benchmark the ops.

As why do them together? Imo design wise it's better to just run one thing to get all of your outputs if possible. IMO the main benefit of just doing end to end model testing is that, it is a bit easier to adapt it to fusions.

For using torchbench specifically, it's more so for simplicity. Inputs are different among the suites, and they test a different set of things. If we expect op coverage given a specific config of a model a researcher shouldn't have the guarantee of "it's covered among all suites", but it should be tighter as "run this suite and it benchmarks all of the ops you need for this model". Torchbench is the best suite here as we can expand it pretty easily.

Fortunately, the torchbench integration is pretty easy to break off into a separate PR, so we can continue discussion there!

if backend_enabled:
with BackendBench.BackendBench.enable(kernel_dir=kernel_dir):
output = model(*args, **kwargs)
loss = output.sum()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience this might be troublesome if add or sum kernels are incorrect/do not consider all input dtypes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that should be fine atm as that's how directorybench is structured.

If we change directorybench to work with overloads, this should still work as ops are defined by their overload in the configs.

# Collect gradients
grads = _collect_gradients(model, args, kwargs)

return output.detach(), grads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that this function tests the forward pass (output) and the backward pass (grad). What's missing here is the parameter update (optim.step). Is there a reason to not test that? E.g., is it out of the scope of our kernel registration?

Copy link
Contributor Author

@PaliC PaliC Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking a bit about this, I think it's out of scope for kernel registration as assuming the gradients and output are correct, the optimizer should perform the same thing, and when we do add numerical correctness over training it should be accounted for.

return models


class ModelSuite(TorchBenchTestSuite):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW why do we need torchbench here? For design I would expect the model backend only tests end-to-end correctness and performance. So the test suite do not overlap with each other.

model_config,
test_args,
backend_enabled=False,
kernel_dir=kernel_dir,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need kernel_dir here given backend_enabled is False?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair it's a bit bug prone.

@PaliC
Copy link
Contributor Author

PaliC commented Oct 2, 2025

As a follow-up idea (not for this PR), I noticed that the current model backend requires the model definition code. Would it be possible to support loading models directly from Hugging Face? This would make it easier to test with real models and broaden the scope of testing.

I think this would require some design changes for supporting this (ie. how to handle ops, and the loaders), but yeah I agree. First the important bit is the ability to load and test arbitrary models though.

@PaliC PaliC changed the title Add model backend Add model backend, add tests Oct 2, 2025
@PaliC PaliC mentioned this pull request Oct 2, 2025
@PaliC PaliC marked this pull request as draft October 2, 2025 03:50
@PaliC
Copy link
Contributor Author

PaliC commented Oct 2, 2025

Broke everything out into this stack here for ease of reviewing #183

@PaliC PaliC closed this Oct 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants