Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a new throughput-saturation benchmark: experiment orchestration via LangraphWorkload, per-invocation metric extraction from event logs, incremental result persistence, and matplotlib plot generation for throughput and overhead summaries across agent sweeps. Changes
Sequence Diagram(s)sequenceDiagram
participant Runner as Benchmark Runner
participant Exp as ThroughputSaturation
participant Config as Config Loader
participant Workload as LangraphWorkload
participant Metrics as Metric Extractor
participant Storage as Data Storage
participant Plotter as ThroughputSaturationPlotter
Runner->>Exp: instantiate(config, data_dir, plots_dir)
Exp->>Config: _load_experiment_config()
Config-->>Exp: sweep params (n_agents, calls_per_agent, slots, duration)
loop for each n_agents in sweep
Exp->>Workload: run_workload(WorkloadConfig)
Workload-->>Exp: workload_result (events, total_makespan)
Exp->>Metrics: extract_invocation_metrics(events)
Metrics-->>Exp: per-invocation timings, throughput, percentiles, cache-hits
Exp->>Storage: store_data_to_disk(record)
Storage-->>Exp: persisted
end
Runner->>Exp: generate_plots(data)
Exp->>Plotter: plot_results(data["throughput_saturation"])
Plotter-->>Runner: saved figures (4)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces the ThroughputSaturation experiment to measure coordination throughput and overhead under concurrent agent loads. The implementation includes a new experiment class, metric extraction utilities, and a plotter for performance visualization. Review feedback focuses on improving the robustness of data validation in the plotter to prevent potential KeyError exceptions, removing an unused variable, refactoring repetitive plotting boilerplate, and adhering to PEP 8 standards by moving an inline import to the top level.
|
|
||
| def _plot_overhead_breakdown(self, records: List[Dict[str, Any]]) -> None: | ||
| """Boxplot of D_resolve and D_collect side-by-side per n_agents.""" | ||
| has_box = all(r.get("d_resolve_box") for r in records) |
There was a problem hiding this comment.
The check for box plot data is incomplete. It only checks for d_resolve_box, but the function also uses d_collect_box on line 96. If d_collect_box is missing from a record where d_resolve_box is present, it will raise a KeyError. The check should be made more robust to handle this case.
| has_box = all(r.get("d_resolve_box") for r in records) | |
| has_box = all(r.get("d_resolve_box") and r.get("d_collect_box") for r in records) |
| d_total_list = [] | ||
| invoke_starts = [] | ||
| collect_ends = [] | ||
| cache_hits = [] |
| def _plot_throughput(self, records: List[Dict[str, Any]]) -> None: | ||
| """Throughput (inv/s) vs n_agents.""" | ||
| fig, ax = plt.subplots(figsize=(10, 6)) | ||
|
|
||
| n_agents = [r["n_agents"] for r in records] | ||
| throughputs = [r.get("throughput", 0) for r in records] | ||
|
|
||
| ax.plot(n_agents, throughputs, marker="o", linewidth=2, markersize=8, color="steelblue") | ||
|
|
||
| ax.set_xscale("log", base=2) | ||
| ax.set_xlabel("Number of Agents (concurrent load)", fontsize=13) | ||
| ax.set_ylabel("Throughput (invocations/s)", fontsize=13) | ||
| ax.set_title( | ||
| "FlowGentic Coordination Throughput vs Concurrent Agents\n" | ||
| "(noop tools — D_backend ≈ 0 for FlowGentic isolation)", | ||
| fontsize=13, | ||
| ) | ||
| ax.grid(True, alpha=0.3) | ||
| ax.set_ylim(bottom=0) | ||
|
|
||
| plt.tight_layout() | ||
| if self.plots_dir: | ||
| path = self.plots_dir / "throughput_vs_agents.png" | ||
| fig.savefig(path, dpi=150, bbox_inches="tight") | ||
| logger.info(f"Saved: {path}") | ||
| plt.close(fig) |
There was a problem hiding this comment.
There is significant boilerplate code for creating, saving, and closing figures, which is repeated in each plotting method (_plot_throughput, _plot_overhead_breakdown, etc.). This can be refactored to improve maintainability and reduce code duplication.
Consider extracting this logic into a helper method or a context manager. For example:
from contextlib import contextmanager
@contextmanager
def _managed_plot(self, filename: str, **subplots_kwargs):
fig, ax = plt.subplots(**subplots_kwargs)
try:
yield fig, ax
finally:
plt.tight_layout()
if self.plots_dir:
path = self.plots_dir / filename
fig.savefig(path, dpi=150, bbox_inches="tight")
logger.info(f"Saved: {path}")
plt.close(fig)This would make the plotting methods much cleaner.
benchmark/data_generation/experiments/throughput_saturation/utils/plots.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@benchmark/data_generation/experiments/throughput_saturation/main.py`:
- Around line 153-156: The experiment is reopening a hard-coded
Path("config.yml") instead of using the already-resolved benchmark config;
update the code in ThroughputSaturationExperiment (and the
_load_experiment_config logic) to use the existing self.benchmark_config (or
accept and use a resolved config_path passed down from
FlowGenticBenchmarkManager) when reading sweep values and settings rather than
opening Path("config.yml"); ensure any logic at the block referenced (lines
~158-167) sources values from the benchmark_config object or the passed
config_path so custom config files and working directories remain consistent
across the benchmark run.
- Around line 58-66: The current check uses truthiness and will drop valid zero
timestamps; update the completeness check to use explicit None comparisons:
replace the if not all([ts_invoke_start, ts_resolve_end, ts_collect_start,
ts_collect_end]): continue with a check such as if not all(v is not None for v
in (ts_invoke_start, ts_resolve_end, ts_collect_start, ts_collect_end)):
continue (or equivalently if any(v is None for v in (...)): continue) so by_id
loop keeps invocations with timestamp 0.0.
In `@benchmark/data_generation/experiments/throughput_saturation/utils/plots.py`:
- Around line 84-87: The current code treats any missing per-record fields as
global absence (has_box check on records using "d_resolve_box"), which causes
sparse sweep points to produce zero/undefined values and disables entire
boxplots; update the plotting functions (the block using has_box and the
functions that call _plot_overhead_fraction) to filter records per-plot by
checking for the specific fields each plot needs (e.g., require "d_resolve_box"
for the boxplot, require the particular overhead fields used by
_plot_overhead_fraction) and pass only those filtered records into the plotting
logic; apply the same per-plot presence checks and filtering where similar
existence checks occur (the other occurrences around lines noted) so sparse runs
are skipped for a plot instead of being treated as zero-valued.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7efd4827-99dc-4f87-843a-ae2d0860c828
📒 Files selected for processing (3)
benchmark/data_generation/experiments/throughput_saturation/main.pybenchmark/data_generation/experiments/throughput_saturation/utils/plots.pybenchmark/data_generation/run_experiments.py
| for inv_id, ts in by_id.items(): | ||
| ts_invoke_start = ts.get("tool_invoke_start") | ||
| ts_resolve_end = ts.get("tool_resolve_end") | ||
| ts_collect_start = ts.get("tool_collect_start") | ||
| ts_collect_end = ts.get("tool_invoke_end") # tool_invoke_end == Ts_collect_end | ||
|
|
||
| # Only include complete invocations | ||
| if not all([ts_invoke_start, ts_resolve_end, ts_collect_start, ts_collect_end]): | ||
| continue |
There was a problem hiding this comment.
Use explicit None checks for timestamp presence.
A valid timestamp of 0.0 is falsy, so normalized traces can drop complete invocations here and undercount both throughput and latency metrics. Compare against None instead of relying on truthiness.
🛠️ Suggested fix
- if not all([ts_invoke_start, ts_resolve_end, ts_collect_start, ts_collect_end]):
+ if any(
+ ts is None
+ for ts in (
+ ts_invoke_start,
+ ts_resolve_end,
+ ts_collect_start,
+ ts_collect_end,
+ )
+ ):
continue🧰 Tools
🪛 Ruff (0.15.7)
[warning] 58-58: Loop control variable inv_id not used within loop body
(B007)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@benchmark/data_generation/experiments/throughput_saturation/main.py` around
lines 58 - 66, The current check uses truthiness and will drop valid zero
timestamps; update the completeness check to use explicit None comparisons:
replace the if not all([ts_invoke_start, ts_resolve_end, ts_collect_start,
ts_collect_end]): continue with a check such as if not all(v is not None for v
in (ts_invoke_start, ts_resolve_end, ts_collect_start, ts_collect_end)):
continue (or equivalently if any(v is None for v in (...)): continue) so by_id
loop keeps invocations with timestamp 0.0.
| self.benchmark_config = benchmark_config | ||
| self.plotter = ThroughputSaturationPlotter(plots_dir=plots_dir) | ||
| self.results: Dict[str, Any] = {} | ||
| self._load_experiment_config() |
There was a problem hiding this comment.
Don't bypass the already-loaded benchmark config.
FlowGenticBenchmarkManager already builds benchmark_config from its configured path, but this experiment then reopens Path("config.yml") directly. That makes custom config files and different working directories diverge from the rest of the benchmark run. Please pass the resolved config path down, or source these sweep values from the already-loaded configuration object instead.
Also applies to: 158-167
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@benchmark/data_generation/experiments/throughput_saturation/main.py` around
lines 153 - 156, The experiment is reopening a hard-coded Path("config.yml")
instead of using the already-resolved benchmark config; update the code in
ThroughputSaturationExperiment (and the _load_experiment_config logic) to use
the existing self.benchmark_config (or accept and use a resolved config_path
passed down from FlowGenticBenchmarkManager) when reading sweep values and
settings rather than opening Path("config.yml"); ensure any logic at the block
referenced (lines ~158-167) sources values from the benchmark_config object or
the passed config_path so custom config files and working directories remain
consistent across the benchmark run.
| has_box = all(r.get("d_resolve_box") for r in records) | ||
| if not has_box: | ||
| logger.warning("No box stats found — re-run experiment to generate boxplots.") | ||
| return |
There was a problem hiding this comment.
Skip sparse runs per plot instead of turning them into missing/zero-valued overhead plots.
extract_invocation_metrics() in benchmark/data_generation/experiments/throughput_saturation/main.py can emit records with only n_completions when a sweep point has no complete invocations. In that case, one bad point disables both boxplot figures here, and _plot_overhead_fraction() renders an undefined metric as 0%. Filter each plot to the records that actually contain the fields it needs.
Also applies to: 133-136, 171-172
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@benchmark/data_generation/experiments/throughput_saturation/utils/plots.py`
around lines 84 - 87, The current code treats any missing per-record fields as
global absence (has_box check on records using "d_resolve_box"), which causes
sparse sweep points to produce zero/undefined values and disables entire
boxplots; update the plotting functions (the block using has_box and the
functions that call _plot_overhead_fraction) to filter records per-plot by
checking for the specific fields each plot needs (e.g., require "d_resolve_box"
for the boxplot, require the particular overhead fields used by
_plot_overhead_fraction) and pass only those filtered records into the plotting
logic; apply the same per-plot presence checks and filtering where similar
existence checks occur (the other occurrences around lines noted) so sparse runs
are skipped for a plot instead of being treated as zero-valued.
Summary by CodeRabbit