This document summarizes the benchmarking approach for the MIND compiler and runtime. The goal is to track regressions and communicate baseline performance characteristics.
- Target Hardware – Default runs target x86-64 AVX2 hosts with 32 GB RAM; GPU runs target CUDA-capable cards when
mlir-execis enabled. - Datasets – Synthetic workloads (matrix multiplications, convolutions) plus representative ML kernels sourced from
benchmarks/. - Execution Modes – Interpreter (
cpu-exec), ahead-of-time MLIR (mlir-build), and JIT (mlir-exec). - Warmup & Repetitions – Each benchmark performs 3 warmup runs followed by 10 measured iterations; results report median and 95th percentile.
The following baselines were collected on reference hardware (Intel Core i7-5930K @ 3.50GHz, 64 GB DDR4, RTX 3080 10GB, Ubuntu 24.04 LTS) using Rust 1.84 stable.
| Operation | Input Size | Time (median) | Memory |
|---|---|---|---|
| Parse | 1K LOC | 2.1 ms | 12 MB |
| Parse | 10K LOC | 18 ms | 45 MB |
| Type check | 1K LOC | 4.3 ms | 18 MB |
| Type check | 10K LOC | 38 ms | 85 MB |
| IR lower | 1K LOC | 1.8 ms | 8 MB |
| IR lower | 10K LOC | 15 ms | 42 MB |
| MLIR emit | 1K ops | 3.2 ms | 15 MB |
| MLIR emit | 10K ops | 28 ms | 95 MB |
| Tensor Rank | Broadcast Dims | Time |
|---|---|---|
| 2D | 0 | 0.8 μs |
| 2D | 2 | 1.2 μs |
| 4D | 0 | 1.5 μs |
| 4D | 4 | 2.8 μs |
| 8D | 4 | 5.1 μs |
| Function Complexity | Forward Ops | Grad Gen Time |
|---|---|---|
| Simple (add/mul) | 10 | 0.4 ms |
| Medium (matmul chain) | 100 | 3.2 ms |
| Complex (conv + reduce) | 1000 | 28 ms |
| Category | Test Count | Total Time |
|---|---|---|
| Unit tests | 80 | ~0.2 s |
| Integration tests | 89 | ~0.5 s |
| Full suite | 169+ | ~1 s |
| Metric | Description |
|---|---|
| Latency | Execution time per run (ms) |
| Throughput | Ops or samples per second |
| Memory usage | Peak RSS collected via procfs helpers |
| Compile time | IR → MLIR → executable duration |
Results are exported as JSON into benchmarks/results/*.json and visualized with the CLI (mind bench report).
Continuous integration runs a smoke subset on every pull request. Nightly jobs execute the full suite and compare against the rolling baseline stored in benchmarks/baselines/.
When a regression exceeds thresholds:
- CI marks the run unstable and attaches artifacts.
- Engineers inspect IR/MLIR dumps to identify passes responsible for the change.
- A follow-up issue documents the root cause and mitigation plan.
- GPU benchmark coverage for the runtime plugin API
- Automated comparison against PyTorch/XLA baselines
- Visualization dashboards for long-term trends
See the roadmap for scheduling details.