Throughput experiment by dianacord · Pull Request #2 · stride-research/flowgentic_benchmarking

dianacord · 2026-03-27T11:39:37Z

Summary by CodeRabbit

New Features
- Added a throughput-saturation benchmark to measure performance across varying agent counts.
- Automatically extracts per-invocation timing, throughput, percentiles, and cache-hit summaries.
- Produces four plots: throughput vs agents, overhead breakdown, overhead distribution, and overhead fraction.
- Integrated the experiment into the benchmark runner for automated execution and incremental result storage.

coderabbitai · 2026-03-27T11:39:50Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bf218a9d-98ff-494d-81a5-53e9434977b6

📥 Commits

Reviewing files that changed from the base of the PR and between d27c689 and b6197c6.

📒 Files selected for processing (1)

benchmark/data_generation/run_experiments.py

🚧 Files skipped from review as they are similar to previous changes (1)

benchmark/data_generation/run_experiments.py

📝 Walkthrough

Walkthrough

Adds a new throughput-saturation benchmark: experiment orchestration via LangraphWorkload, per-invocation metric extraction from event logs, incremental result persistence, and matplotlib plot generation for throughput and overhead summaries across agent sweeps.

Changes

Cohort / File(s)	Summary
Core Experiment Implementation `benchmark/data_generation/experiments/throughput_saturation/main.py`	New `ThroughputSaturation` class (extends `BaseExperiment`) and `extract_invocation_metrics(events)` function. Loads sweep config, runs async workloads (LangraphWorkload), computes per-invocation timings (resolve/backend/collect/overhead/end-to-end), makespan/throughput, percentiles/means, cache-hit counts, and persists incremental results.
Plotting Utilities `benchmark/data_generation/experiments/throughput_saturation/utils/plots.py`	New `ThroughputSaturationPlotter` (subclass of `BasePlotter`) producing four matplotlib figures: throughput vs agents (log-scale), grouped resolve vs collect boxplots, overhead boxplot, and overhead-fraction vs agents (log-scale). Validates required box-stat fields and saves figures to `plots_dir`.
Integration / Registration `benchmark/data_generation/run_experiments.py`	Imports and registers the `ThroughputSaturation` experiment under the harness (`"throughput_saturation"`), adding it alongside existing experiments.

Sequence Diagram(s)

sequenceDiagram
    participant Runner as Benchmark Runner
    participant Exp as ThroughputSaturation
    participant Config as Config Loader
    participant Workload as LangraphWorkload
    participant Metrics as Metric Extractor
    participant Storage as Data Storage
    participant Plotter as ThroughputSaturationPlotter

    Runner->>Exp: instantiate(config, data_dir, plots_dir)
    Exp->>Config: _load_experiment_config()
    Config-->>Exp: sweep params (n_agents, calls_per_agent, slots, duration)

    loop for each n_agents in sweep
        Exp->>Workload: run_workload(WorkloadConfig)
        Workload-->>Exp: workload_result (events, total_makespan)
        Exp->>Metrics: extract_invocation_metrics(events)
        Metrics-->>Exp: per-invocation timings, throughput, percentiles, cache-hits
        Exp->>Storage: store_data_to_disk(record)
        Storage-->>Exp: persisted
    end

    Runner->>Exp: generate_plots(data)
    Exp->>Plotter: plot_results(data["throughput_saturation"])
    Plotter-->>Runner: saved figures (4)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

mturilli

Poem

🐰
I hopped through logs at break of dawn,
Timings counted, makespans drawn,
Agents scaled and throughputs found,
Four bright plots now hop around,
A carrot-shaped benchmark — metrics on! 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Throughput experiment' is vague and generic. While it broadly relates to the throughput-saturation experiment being added, it lacks specificity about what is being introduced or changed.	Consider a more descriptive title such as 'Add throughput-saturation benchmark experiment' or 'Implement ThroughputSaturation workload with metrics and plots' to clearly convey the main changes.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch throughput_experiment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces the ThroughputSaturation experiment to measure coordination throughput and overhead under concurrent agent loads. The implementation includes a new experiment class, metric extraction utilities, and a plotter for performance visualization. Review feedback focuses on improving the robustness of data validation in the plotter to prevent potential KeyError exceptions, removing an unused variable, refactoring repetitive plotting boilerplate, and adhering to PEP 8 standards by moving an inline import to the top level.

gemini-code-assist · 2026-03-27T11:41:37Z

benchmark/data_generation/experiments/throughput_saturation/utils/plots.py

+
+	def _plot_overhead_breakdown(self, records: List[Dict[str, Any]]) -> None:
+		"""Boxplot of D_resolve and D_collect side-by-side per n_agents."""
+		has_box = all(r.get("d_resolve_box") for r in records)


The check for box plot data is incomplete. It only checks for d_resolve_box, but the function also uses d_collect_box on line 96. If d_collect_box is missing from a record where d_resolve_box is present, it will raise a KeyError. The check should be made more robust to handle this case.

Suggested change

has_box = all(r.get("d_resolve_box") for r in records)

has_box = all(r.get("d_resolve_box") and r.get("d_collect_box") for r in records)

gemini-code-assist · 2026-03-27T11:41:37Z

benchmark/data_generation/experiments/throughput_saturation/main.py

+	d_total_list = []
+	invoke_starts = []
+	collect_ends = []
+	cache_hits = []


The cache_hits list is initialized but never used within the function. It can be removed to improve code clarity.

gemini-code-assist · 2026-03-27T11:41:37Z

benchmark/data_generation/experiments/throughput_saturation/utils/plots.py

+	def _plot_throughput(self, records: List[Dict[str, Any]]) -> None:
+		"""Throughput (inv/s) vs n_agents."""
+		fig, ax = plt.subplots(figsize=(10, 6))
+
+		n_agents = [r["n_agents"] for r in records]
+		throughputs = [r.get("throughput", 0) for r in records]
+
+		ax.plot(n_agents, throughputs, marker="o", linewidth=2, markersize=8, color="steelblue")
+
+		ax.set_xscale("log", base=2)
+		ax.set_xlabel("Number of Agents (concurrent load)", fontsize=13)
+		ax.set_ylabel("Throughput (invocations/s)", fontsize=13)
+		ax.set_title(
+			"FlowGentic Coordination Throughput vs Concurrent Agents\n"
+			"(noop tools — D_backend ≈ 0 for FlowGentic isolation)",
+			fontsize=13,
+		)
+		ax.grid(True, alpha=0.3)
+		ax.set_ylim(bottom=0)
+
+		plt.tight_layout()
+		if self.plots_dir:
+			path = self.plots_dir / "throughput_vs_agents.png"
+			fig.savefig(path, dpi=150, bbox_inches="tight")
+			logger.info(f"Saved: {path}")
+		plt.close(fig)


There is significant boilerplate code for creating, saving, and closing figures, which is repeated in each plotting method (_plot_throughput, _plot_overhead_breakdown, etc.). This can be refactored to improve maintainability and reduce code duplication.

Consider extracting this logic into a helper method or a context manager. For example:

from contextlib import contextmanager @contextmanager def _managed_plot(self, filename: str, **subplots_kwargs): fig, ax = plt.subplots(**subplots_kwargs) try: yield fig, ax finally: plt.tight_layout() if self.plots_dir: path = self.plots_dir / filename fig.savefig(path, dpi=150, bbox_inches="tight") logger.info(f"Saved: {path}") plt.close(fig)

This would make the plotting methods much cleaner.

benchmark/data_generation/experiments/throughput_saturation/utils/plots.py

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmark/data_generation/experiments/throughput_saturation/main.py`:
- Around line 153-156: The experiment is reopening a hard-coded
Path("config.yml") instead of using the already-resolved benchmark config;
update the code in ThroughputSaturationExperiment (and the
_load_experiment_config logic) to use the existing self.benchmark_config (or
accept and use a resolved config_path passed down from
FlowGenticBenchmarkManager) when reading sweep values and settings rather than
opening Path("config.yml"); ensure any logic at the block referenced (lines
~158-167) sources values from the benchmark_config object or the passed
config_path so custom config files and working directories remain consistent
across the benchmark run.
- Around line 58-66: The current check uses truthiness and will drop valid zero
timestamps; update the completeness check to use explicit None comparisons:
replace the if not all([ts_invoke_start, ts_resolve_end, ts_collect_start,
ts_collect_end]): continue with a check such as if not all(v is not None for v
in (ts_invoke_start, ts_resolve_end, ts_collect_start, ts_collect_end)):
continue (or equivalently if any(v is None for v in (...)): continue) so by_id
loop keeps invocations with timestamp 0.0.

In `@benchmark/data_generation/experiments/throughput_saturation/utils/plots.py`:
- Around line 84-87: The current code treats any missing per-record fields as
global absence (has_box check on records using "d_resolve_box"), which causes
sparse sweep points to produce zero/undefined values and disables entire
boxplots; update the plotting functions (the block using has_box and the
functions that call _plot_overhead_fraction) to filter records per-plot by
checking for the specific fields each plot needs (e.g., require "d_resolve_box"
for the boxplot, require the particular overhead fields used by
_plot_overhead_fraction) and pass only those filtered records into the plotting
logic; apply the same per-plot presence checks and filtering where similar
existence checks occur (the other occurrences around lines noted) so sparse runs
are skipped for a plot instead of being treated as zero-valued.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7efd4827-99dc-4f87-843a-ae2d0860c828

📥 Commits

Reviewing files that changed from the base of the PR and between ba6481f and d27c689.

📒 Files selected for processing (3)

benchmark/data_generation/experiments/throughput_saturation/main.py
benchmark/data_generation/experiments/throughput_saturation/utils/plots.py
benchmark/data_generation/run_experiments.py

coderabbitai · 2026-03-27T11:47:39Z

benchmark/data_generation/experiments/throughput_saturation/main.py

+	for inv_id, ts in by_id.items():
+		ts_invoke_start = ts.get("tool_invoke_start")
+		ts_resolve_end = ts.get("tool_resolve_end")
+		ts_collect_start = ts.get("tool_collect_start")
+		ts_collect_end = ts.get("tool_invoke_end")  # tool_invoke_end == Ts_collect_end
+
+		# Only include complete invocations
+		if not all([ts_invoke_start, ts_resolve_end, ts_collect_start, ts_collect_end]):
+			continue


⚠️ Potential issue | 🟠 Major

Use explicit None checks for timestamp presence.

A valid timestamp of 0.0 is falsy, so normalized traces can drop complete invocations here and undercount both throughput and latency metrics. Compare against None instead of relying on truthiness.

🛠️ Suggested fix

- if not all([ts_invoke_start, ts_resolve_end, ts_collect_start, ts_collect_end]): + if any( + ts is None + for ts in ( + ts_invoke_start, + ts_resolve_end, + ts_collect_start, + ts_collect_end, + ) + ): continue

🧰 Tools

🪛 Ruff (0.15.7)

[warning] 58-58: Loop control variable inv_id not used within loop body

(B007)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@benchmark/data_generation/experiments/throughput_saturation/main.py` around lines 58 - 66, The current check uses truthiness and will drop valid zero timestamps; update the completeness check to use explicit None comparisons: replace the if not all([ts_invoke_start, ts_resolve_end, ts_collect_start, ts_collect_end]): continue with a check such as if not all(v is not None for v in (ts_invoke_start, ts_resolve_end, ts_collect_start, ts_collect_end)): continue (or equivalently if any(v is None for v in (...)): continue) so by_id loop keeps invocations with timestamp 0.0.

coderabbitai · 2026-03-27T11:47:39Z

benchmark/data_generation/experiments/throughput_saturation/main.py

+		self.benchmark_config = benchmark_config
+		self.plotter = ThroughputSaturationPlotter(plots_dir=plots_dir)
+		self.results: Dict[str, Any] = {}
+		self._load_experiment_config()


⚠️ Potential issue | 🟠 Major

Don't bypass the already-loaded benchmark config.

FlowGenticBenchmarkManager already builds benchmark_config from its configured path, but this experiment then reopens Path("config.yml") directly. That makes custom config files and different working directories diverge from the rest of the benchmark run. Please pass the resolved config path down, or source these sweep values from the already-loaded configuration object instead.

Also applies to: 158-167

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@benchmark/data_generation/experiments/throughput_saturation/main.py` around lines 153 - 156, The experiment is reopening a hard-coded Path("config.yml") instead of using the already-resolved benchmark config; update the code in ThroughputSaturationExperiment (and the _load_experiment_config logic) to use the existing self.benchmark_config (or accept and use a resolved config_path passed down from FlowGenticBenchmarkManager) when reading sweep values and settings rather than opening Path("config.yml"); ensure any logic at the block referenced (lines ~158-167) sources values from the benchmark_config object or the passed config_path so custom config files and working directories remain consistent across the benchmark run.

coderabbitai · 2026-03-27T11:47:39Z

benchmark/data_generation/experiments/throughput_saturation/utils/plots.py

+		has_box = all(r.get("d_resolve_box") for r in records)
+		if not has_box:
+			logger.warning("No box stats found — re-run experiment to generate boxplots.")
+			return


⚠️ Potential issue | 🟠 Major

Skip sparse runs per plot instead of turning them into missing/zero-valued overhead plots.

extract_invocation_metrics() in benchmark/data_generation/experiments/throughput_saturation/main.py can emit records with only n_completions when a sweep point has no complete invocations. In that case, one bad point disables both boxplot figures here, and _plot_overhead_fraction() renders an undefined metric as 0%. Filter each plot to the records that actually contain the fields it needs.

Also applies to: 133-136, 171-172

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@benchmark/data_generation/experiments/throughput_saturation/utils/plots.py` around lines 84 - 87, The current code treats any missing per-record fields as global absence (has_box check on records using "d_resolve_box"), which causes sparse sweep points to produce zero/undefined values and disables entire boxplots; update the plotting functions (the block using has_box and the functions that call _plot_overhead_fraction) to filter records per-plot by checking for the specific fields each plot needs (e.g., require "d_resolve_box" for the boxplot, require the particular overhead fields used by _plot_overhead_fraction) and pass only those filtered records into the plotting logic; apply the same per-plot presence checks and filtering where similar existence checks occur (the other occurrences around lines noted) so sparse runs are skipped for a plot instead of being treated as zero-valued.

dianacord added 2 commits March 18, 2026 15:43

Migrated throughput experiment to benchmarking repo

9a4e395

Deleted config

d27c689

dianacord requested a review from javidsegura March 27, 2026 11:39

gemini-code-assist bot reviewed Mar 27, 2026

View reviewed changes

Merge branch 'main' into throughput_experiment

b6197c6

coderabbitai bot reviewed Mar 27, 2026

View reviewed changes

	has_box = all(r.get("d_resolve_box") for r in records)
	has_box = all(r.get("d_resolve_box") and r.get("d_collect_box") for r in records)

Conversation

dianacord commented Mar 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dianacord commented Mar 27, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 27, 2026 •

edited

Loading