[ModelSuite] Refactor TorchBench for ModelSuite inheritance

PaliC · PaliC · commit dda5832c5bbc · 2025-10-02T08:29:40.000Z
This PR integrates operator benchmarking into the Model Suite by having it inherit from TorchBenchTestSuite. The suite now extracts operator lists from model configs and benchmarks those operators using TorchBench data before running end-to-end model tests. This approach aligns with the core goal of BackendBench: testing operators. The Model Suite is designed with the assumption that for a given set of ops, users can provide kernel implementations, and the suite will benchmark both the individual ops and the full model using those implementations. The long-term vision is to make this process seamless—allowing users to run both operator and model benchmarking with a single command. TorchBench is used here because it provides the strongest guarantee that running the suite benchmarks all operators required for a specific model configuration. Its dataset is easily extensible and includes realistic tensor shapes derived from actual models. The main design drawback is that this integration makes supporting kernel fusions with models more complex. However, it is preferable to handle kernel fusions in a separate suite regardless. ### Testing Running `uv run python BackendBench/scripts/main.py --suite model --backend directory --topn 1` with a working mm kernel and other kernels being watermakred yeilds the expected result (below) ```bash Successfully registered 36 custom operators [2025-10-02 07:21:23][INFO][main.py] ============================================================ [2025-10-02 07:21:23][INFO][main.py] MODEL EVALUATION RESULTS [2025-10-02 07:21:23][INFO][main.py] ============================================================ [2025-10-02 07:21:23][INFO][model.py] Model: ToyCoreOpsModel [2025-10-02 07:21:23][INFO][model.py] Status: ✗ Failed (0/3 tests) [2025-10-02 07:21:23][INFO][model.py] ✗ small_batch [2025-10-02 07:21:23][INFO][model.py] Error: Model ToyCoreOpsModel::small_batch failed: Expected number of channels in input to be divisible by num_groups, but got input of shape [2, 3, 32, 32] and num_groups=8 [2025-10-02 07:21:23][INFO][model.py] ✗ medium_batch [2025-10-02 07:21:23][INFO][model.py] Error: Model ToyCoreOpsModel::medium_batch failed: Expected number of channels in input to be divisible by num_groups, but got input of shape [4, 3, 64, 64] and num_groups=8 [2025-10-02 07:21:23][INFO][model.py] ✗ large_input [2025-10-02 07:21:23][INFO][model.py] Error: Model ToyCoreOpsModel::large_input failed: Expected number of channels in input to be divisible by num_groups, but got input of shape [2, 3, 128, 128] and num_groups=8 [2025-10-02 07:21:23][INFO][model.py] Model: SmokeTestModel [2025-10-02 07:21:23][INFO][model.py] Status: ✓ Passed (3/3 tests) [2025-10-02 07:21:23][INFO][model.py] ✓ small_batch [2025-10-02 07:21:23][INFO][model.py] Output match: ✓ Gradients match: ✓ (4 gradients) [2025-10-02 07:21:23][INFO][model.py] ✓ medium_batch [2025-10-02 07:21:23][INFO][model.py] Output match: ✓ Gradients match: ✓ (4 gradients) [2025-10-02 07:21:23][INFO][model.py] ✓ large_batch [2025-10-02 07:21:23][INFO][model.py] Output match: ✓ Gradients match: ✓ (4 gradients) [2025-10-02 07:21:23][INFO][main.py] ============================================================ [2025-10-02 07:21:23][INFO][output.py] Full results saved to generated_kernels/full_results.json [2025-10-02 07:21:23][INFO][output.py] Operator summary CSV saved to generated_kernels/operator_summary.csv [2025-10-02 07:21:23][INFO][output.py] Failed operations log saved to generated_kernels/failed_tests.json [2025-10-02 07:21:23][INFO][output.py] Overall summary saved to generated_kernels/OVERALL_SUMMARY.md [2025-10-02 07:21:23][INFO][output.py] Results saved to directory: /home/dev/sapling_repos/BackendBench/generated_kernels Results saved to directory: /home/dev/sapling_repos/BackendBench/generated_kernels Overall summary saved to: /home/dev/sapling_repos/BackendBench/generated_kernels/OVERALL_SUMMARY.md ``` ### Future work with Model Suite #181
diff --git a/BackendBench/scripts/main.py b/BackendBench/scripts/main.py
@@ -184,8 +184,6 @@ def cli(
     p,
 ):
     if suite != "torchbench":
-        if topn_inputs is not None:
-            raise ValueError("topn-inputs is only supported for torchbench suite")
         if check_overhead_dominated_ops:
             raise ValueError("check-overhead-dominated-ops is only supported for torchbench suite")
 
@@ -198,6 +196,10 @@ def cli(
     if suite != "model" and model_filter is not None:
         raise ValueError("--model-filter is only supported for model suite")
 
+    if suite != "model" and suite != "torchbench":
+        if topn_inputs is not None:
+            raise ValueError("topn-inputs is only supported for torchbench suite")
+
     setup_logging(log_level)
     if ops:
         ops = ops.split(",")
@@ -225,7 +227,7 @@ def cli(
             torch.bfloat16,
             filter=ops,
         ),
-        "model": lambda: ModelSuite(filter=model_filter),
+        "model": lambda: ModelSuite(filter=model_filter, topn=topn_inputs),
     }[suite]()
 
     backend_name = backend
@@ -259,11 +261,6 @@ def cli(
             timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
             log_dir = f"backendbench_output_{timestamp}"
 
-    if suite.name == "model":
-        _test_full_models(suite, backend)
-        # currently model suite does not support op testing so now we're done
-        return
-
     overall_correctness = []
     overall_performance = []
     all_correctness_results = []
@@ -332,6 +329,9 @@ def cli(
         f"perf@p score (rate of correct samples with a speedup greater than p, p={p}): {perf_at_p_score:.2f}"
     )
 
+    if suite.name == "model":
+        _test_full_models(suite, backend)
+
     command = "python -m BackendBench.scripts.main " + " ".join(sys.argv[1:])
 
     # Save results if not disabled
diff --git a/BackendBench/suite/model.py b/BackendBench/suite/model.py
@@ -5,7 +5,11 @@
 # LICENSE file in the root directory of this source tree.
 
 """
-Model Suite for testing models defined in configs.
+Model Suite for testing operators defined in toy model configs.
+
+This suite extends TorchBenchTestSuite by reading operator lists from
+model configs, validating they exist in the TorchBench dataset, then
+filtering to include only those operators.
 """
 
 import importlib.util
@@ -16,6 +20,8 @@
 
 from BackendBench.eval_model import eval_model_correctness_test
 
+from .torchbench import TorchBenchTestSuite
+
 logger = logging.getLogger(__name__)
 
 
@@ -89,29 +95,52 @@ def load_models(
     return models
 
 
-class ModelSuite:
-    """Model Suite for end-to-end model testing."""
+class ModelSuite(TorchBenchTestSuite):
+    """Model Suite that filters TorchBench operators based on model configs.
+
+    This suite reads operator lists from model configs, validates they exist
+    in the TorchBench dataset, then creates a filtered suite containing only
+    those operators.
+    """
 
     def __init__(
         self,
         name: str = "model",
         filter: Optional[List[str]] = None,
+        topn: Optional[int] = None,
     ):
         """Initialize ModelSuite.
 
         Args:
             name: Suite name (default: "model")
             filter: Optional list of model names to load
+            topn: Optional limit on number of tests per operator
         """
         models_dir = os.path.join(os.path.dirname(__file__), "models")
 
         # Load models
         models = load_models(models_dir=models_dir, filter=filter)
         logger.info(f"ModelSuite: Loaded {len(models)} models from {models_dir}")
-
-        # Store loaded models
+        model_ops = self.get_model_ops(models)
+        filter = list(model_ops)
+        # Store loaded models for evaluation
         self.models = models
-        self.name = name
+
+        self._initialize_torchbench_suite(name, None, filter, topn, False)
+
+    def get_model_ops(self, models: List[Dict[str, Any]]) -> List[str]:
+        # Extract operators from model configs
+        model_ops = set()
+        for model in models:
+            config_ops = model["config"]["ops"]
+            ops_list = config_ops["forward"]
+            ops_list.extend(config_ops["backward"])
+
+            model_ops.update(ops_list)
+            logger.info(f"Model {model['name']}: {len(ops_list)} operators defined in config")
+
+        logger.info(f"ModelSuite: Total {len(model_ops)} unique operators across all models")
+        return model_ops
 
     def eval_model(self, model_dict: Dict[str, Any], backend) -> Dict[str, Any]:
         """Run evaluation on a single model.
diff --git a/BackendBench/suite/torchbench.py b/BackendBench/suite/torchbench.py
@@ -78,6 +78,13 @@ def __init__(
         filter=None,
         topn=None,
         check_overhead_dominated_ops=False,
+    ):
+        self._initialize_torchbench_suite(
+            name, filename, filter, topn, check_overhead_dominated_ops
+        )
+
+    def _initialize_torchbench_suite(
+        self, name, filename, filter, topn, check_overhead_dominated_ops
     ):
         self.name = name
         self.topn = topn
@@ -87,6 +94,7 @@ def __init__(
             format="auto",  # Auto-detect based on file extension
             filter=filter,
         )
+
         if check_overhead_dominated_ops:
             # Only include ops which are overhead dominated (this is useful as a performance canary)
             ops_list = [op for op in ops_list if op.get("is_overhead_dominated_op", False)]