Skip to content

Throughput experiment#2

Open
dianacord wants to merge 3 commits intomainfrom
throughput_experiment
Open

Throughput experiment#2
dianacord wants to merge 3 commits intomainfrom
throughput_experiment

Conversation

@dianacord
Copy link
Copy Markdown
Collaborator

@dianacord dianacord commented Mar 27, 2026

Summary by CodeRabbit

  • New Features
    • Added a throughput-saturation benchmark to measure performance across varying agent counts.
    • Automatically extracts per-invocation timing, throughput, percentiles, and cache-hit summaries.
    • Produces four plots: throughput vs agents, overhead breakdown, overhead distribution, and overhead fraction.
    • Integrated the experiment into the benchmark runner for automated execution and incremental result storage.

@dianacord dianacord requested a review from javidsegura March 27, 2026 11:39
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 27, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bf218a9d-98ff-494d-81a5-53e9434977b6

📥 Commits

Reviewing files that changed from the base of the PR and between d27c689 and b6197c6.

📒 Files selected for processing (1)
  • benchmark/data_generation/run_experiments.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • benchmark/data_generation/run_experiments.py

📝 Walkthrough

Walkthrough

Adds a new throughput-saturation benchmark: experiment orchestration via LangraphWorkload, per-invocation metric extraction from event logs, incremental result persistence, and matplotlib plot generation for throughput and overhead summaries across agent sweeps.

Changes

Cohort / File(s) Summary
Core Experiment Implementation
benchmark/data_generation/experiments/throughput_saturation/main.py
New ThroughputSaturation class (extends BaseExperiment) and extract_invocation_metrics(events) function. Loads sweep config, runs async workloads (LangraphWorkload), computes per-invocation timings (resolve/backend/collect/overhead/end-to-end), makespan/throughput, percentiles/means, cache-hit counts, and persists incremental results.
Plotting Utilities
benchmark/data_generation/experiments/throughput_saturation/utils/plots.py
New ThroughputSaturationPlotter (subclass of BasePlotter) producing four matplotlib figures: throughput vs agents (log-scale), grouped resolve vs collect boxplots, overhead boxplot, and overhead-fraction vs agents (log-scale). Validates required box-stat fields and saves figures to plots_dir.
Integration / Registration
benchmark/data_generation/run_experiments.py
Imports and registers the ThroughputSaturation experiment under the harness ("throughput_saturation"), adding it alongside existing experiments.

Sequence Diagram(s)

sequenceDiagram
    participant Runner as Benchmark Runner
    participant Exp as ThroughputSaturation
    participant Config as Config Loader
    participant Workload as LangraphWorkload
    participant Metrics as Metric Extractor
    participant Storage as Data Storage
    participant Plotter as ThroughputSaturationPlotter

    Runner->>Exp: instantiate(config, data_dir, plots_dir)
    Exp->>Config: _load_experiment_config()
    Config-->>Exp: sweep params (n_agents, calls_per_agent, slots, duration)

    loop for each n_agents in sweep
        Exp->>Workload: run_workload(WorkloadConfig)
        Workload-->>Exp: workload_result (events, total_makespan)
        Exp->>Metrics: extract_invocation_metrics(events)
        Metrics-->>Exp: per-invocation timings, throughput, percentiles, cache-hits
        Exp->>Storage: store_data_to_disk(record)
        Storage-->>Exp: persisted
    end

    Runner->>Exp: generate_plots(data)
    Exp->>Plotter: plot_results(data["throughput_saturation"])
    Plotter-->>Runner: saved figures (4)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • mturilli

Poem

🐰
I hopped through logs at break of dawn,
Timings counted, makespans drawn,
Agents scaled and throughputs found,
Four bright plots now hop around,
A carrot-shaped benchmark — metrics on! 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Throughput experiment' is vague and generic. While it broadly relates to the throughput-saturation experiment being added, it lacks specificity about what is being introduced or changed. Consider a more descriptive title such as 'Add throughput-saturation benchmark experiment' or 'Implement ThroughputSaturation workload with metrics and plots' to clearly convey the main changes.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch throughput_experiment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the ThroughputSaturation experiment to measure coordination throughput and overhead under concurrent agent loads. The implementation includes a new experiment class, metric extraction utilities, and a plotter for performance visualization. Review feedback focuses on improving the robustness of data validation in the plotter to prevent potential KeyError exceptions, removing an unused variable, refactoring repetitive plotting boilerplate, and adhering to PEP 8 standards by moving an inline import to the top level.


def _plot_overhead_breakdown(self, records: List[Dict[str, Any]]) -> None:
"""Boxplot of D_resolve and D_collect side-by-side per n_agents."""
has_box = all(r.get("d_resolve_box") for r in records)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The check for box plot data is incomplete. It only checks for d_resolve_box, but the function also uses d_collect_box on line 96. If d_collect_box is missing from a record where d_resolve_box is present, it will raise a KeyError. The check should be made more robust to handle this case.

Suggested change
has_box = all(r.get("d_resolve_box") for r in records)
has_box = all(r.get("d_resolve_box") and r.get("d_collect_box") for r in records)

d_total_list = []
invoke_starts = []
collect_ends = []
cache_hits = []
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The cache_hits list is initialized but never used within the function. It can be removed to improve code clarity.

Comment on lines +55 to +80
def _plot_throughput(self, records: List[Dict[str, Any]]) -> None:
"""Throughput (inv/s) vs n_agents."""
fig, ax = plt.subplots(figsize=(10, 6))

n_agents = [r["n_agents"] for r in records]
throughputs = [r.get("throughput", 0) for r in records]

ax.plot(n_agents, throughputs, marker="o", linewidth=2, markersize=8, color="steelblue")

ax.set_xscale("log", base=2)
ax.set_xlabel("Number of Agents (concurrent load)", fontsize=13)
ax.set_ylabel("Throughput (invocations/s)", fontsize=13)
ax.set_title(
"FlowGentic Coordination Throughput vs Concurrent Agents\n"
"(noop tools — D_backend ≈ 0 for FlowGentic isolation)",
fontsize=13,
)
ax.grid(True, alpha=0.3)
ax.set_ylim(bottom=0)

plt.tight_layout()
if self.plots_dir:
path = self.plots_dir / "throughput_vs_agents.png"
fig.savefig(path, dpi=150, bbox_inches="tight")
logger.info(f"Saved: {path}")
plt.close(fig)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is significant boilerplate code for creating, saving, and closing figures, which is repeated in each plotting method (_plot_throughput, _plot_overhead_breakdown, etc.). This can be refactored to improve maintainability and reduce code duplication.

Consider extracting this logic into a helper method or a context manager. For example:

from contextlib import contextmanager

@contextmanager
def _managed_plot(self, filename: str, **subplots_kwargs):
    fig, ax = plt.subplots(**subplots_kwargs)
    try:
        yield fig, ax
    finally:
        plt.tight_layout()
        if self.plots_dir:
            path = self.plots_dir / filename
            fig.savefig(path, dpi=150, bbox_inches="tight")
            logger.info(f"Saved: {path}")
        plt.close(fig)

This would make the plotting methods much cleaner.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmark/data_generation/experiments/throughput_saturation/main.py`:
- Around line 153-156: The experiment is reopening a hard-coded
Path("config.yml") instead of using the already-resolved benchmark config;
update the code in ThroughputSaturationExperiment (and the
_load_experiment_config logic) to use the existing self.benchmark_config (or
accept and use a resolved config_path passed down from
FlowGenticBenchmarkManager) when reading sweep values and settings rather than
opening Path("config.yml"); ensure any logic at the block referenced (lines
~158-167) sources values from the benchmark_config object or the passed
config_path so custom config files and working directories remain consistent
across the benchmark run.
- Around line 58-66: The current check uses truthiness and will drop valid zero
timestamps; update the completeness check to use explicit None comparisons:
replace the if not all([ts_invoke_start, ts_resolve_end, ts_collect_start,
ts_collect_end]): continue with a check such as if not all(v is not None for v
in (ts_invoke_start, ts_resolve_end, ts_collect_start, ts_collect_end)):
continue (or equivalently if any(v is None for v in (...)): continue) so by_id
loop keeps invocations with timestamp 0.0.

In `@benchmark/data_generation/experiments/throughput_saturation/utils/plots.py`:
- Around line 84-87: The current code treats any missing per-record fields as
global absence (has_box check on records using "d_resolve_box"), which causes
sparse sweep points to produce zero/undefined values and disables entire
boxplots; update the plotting functions (the block using has_box and the
functions that call _plot_overhead_fraction) to filter records per-plot by
checking for the specific fields each plot needs (e.g., require "d_resolve_box"
for the boxplot, require the particular overhead fields used by
_plot_overhead_fraction) and pass only those filtered records into the plotting
logic; apply the same per-plot presence checks and filtering where similar
existence checks occur (the other occurrences around lines noted) so sparse runs
are skipped for a plot instead of being treated as zero-valued.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7efd4827-99dc-4f87-843a-ae2d0860c828

📥 Commits

Reviewing files that changed from the base of the PR and between ba6481f and d27c689.

📒 Files selected for processing (3)
  • benchmark/data_generation/experiments/throughput_saturation/main.py
  • benchmark/data_generation/experiments/throughput_saturation/utils/plots.py
  • benchmark/data_generation/run_experiments.py

Comment on lines +58 to +66
for inv_id, ts in by_id.items():
ts_invoke_start = ts.get("tool_invoke_start")
ts_resolve_end = ts.get("tool_resolve_end")
ts_collect_start = ts.get("tool_collect_start")
ts_collect_end = ts.get("tool_invoke_end") # tool_invoke_end == Ts_collect_end

# Only include complete invocations
if not all([ts_invoke_start, ts_resolve_end, ts_collect_start, ts_collect_end]):
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use explicit None checks for timestamp presence.

A valid timestamp of 0.0 is falsy, so normalized traces can drop complete invocations here and undercount both throughput and latency metrics. Compare against None instead of relying on truthiness.

🛠️ Suggested fix
-		if not all([ts_invoke_start, ts_resolve_end, ts_collect_start, ts_collect_end]):
+		if any(
+			ts is None
+			for ts in (
+				ts_invoke_start,
+				ts_resolve_end,
+				ts_collect_start,
+				ts_collect_end,
+			)
+		):
 			continue
🧰 Tools
🪛 Ruff (0.15.7)

[warning] 58-58: Loop control variable inv_id not used within loop body

(B007)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/data_generation/experiments/throughput_saturation/main.py` around
lines 58 - 66, The current check uses truthiness and will drop valid zero
timestamps; update the completeness check to use explicit None comparisons:
replace the if not all([ts_invoke_start, ts_resolve_end, ts_collect_start,
ts_collect_end]): continue with a check such as if not all(v is not None for v
in (ts_invoke_start, ts_resolve_end, ts_collect_start, ts_collect_end)):
continue (or equivalently if any(v is None for v in (...)): continue) so by_id
loop keeps invocations with timestamp 0.0.

Comment on lines +153 to +156
self.benchmark_config = benchmark_config
self.plotter = ThroughputSaturationPlotter(plots_dir=plots_dir)
self.results: Dict[str, Any] = {}
self._load_experiment_config()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't bypass the already-loaded benchmark config.

FlowGenticBenchmarkManager already builds benchmark_config from its configured path, but this experiment then reopens Path("config.yml") directly. That makes custom config files and different working directories diverge from the rest of the benchmark run. Please pass the resolved config path down, or source these sweep values from the already-loaded configuration object instead.

Also applies to: 158-167

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/data_generation/experiments/throughput_saturation/main.py` around
lines 153 - 156, The experiment is reopening a hard-coded Path("config.yml")
instead of using the already-resolved benchmark config; update the code in
ThroughputSaturationExperiment (and the _load_experiment_config logic) to use
the existing self.benchmark_config (or accept and use a resolved config_path
passed down from FlowGenticBenchmarkManager) when reading sweep values and
settings rather than opening Path("config.yml"); ensure any logic at the block
referenced (lines ~158-167) sources values from the benchmark_config object or
the passed config_path so custom config files and working directories remain
consistent across the benchmark run.

Comment on lines +84 to +87
has_box = all(r.get("d_resolve_box") for r in records)
if not has_box:
logger.warning("No box stats found — re-run experiment to generate boxplots.")
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Skip sparse runs per plot instead of turning them into missing/zero-valued overhead plots.

extract_invocation_metrics() in benchmark/data_generation/experiments/throughput_saturation/main.py can emit records with only n_completions when a sweep point has no complete invocations. In that case, one bad point disables both boxplot figures here, and _plot_overhead_fraction() renders an undefined metric as 0%. Filter each plot to the records that actually contain the fields it needs.

Also applies to: 133-136, 171-172

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/data_generation/experiments/throughput_saturation/utils/plots.py`
around lines 84 - 87, The current code treats any missing per-record fields as
global absence (has_box check on records using "d_resolve_box"), which causes
sparse sweep points to produce zero/undefined values and disables entire
boxplots; update the plotting functions (the block using has_box and the
functions that call _plot_overhead_fraction) to filter records per-plot by
checking for the specific fields each plot needs (e.g., require "d_resolve_box"
for the boxplot, require the particular overhead fields used by
_plot_overhead_fraction) and pass only those filtered records into the plotting
logic; apply the same per-plot presence checks and filtering where similar
existence checks occur (the other occurrences around lines noted) so sparse runs
are skipped for a plot instead of being treated as zero-valued.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant