Experiment1a by dianacord · Pull Request #94 · stride-research/flowgentic

dianacord · 2026-02-04T23:22:22Z

No description provided.

…arch/flowgentic into hotfix/benchmarking_redesign

…arch/flowgentic into experiment1a

coderabbitai · 2026-02-04T23:22:32Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch experiment1a

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-02-04T23:22:40Z

Summary of Changes

Hello @dianacord, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new benchmark experiment, 'Experiment 1a', designed to analyze the coordination throughput of the system under varying invocation rates and ensemble sizes. The experiment aims to generate saturation curves, providing insights into how the system performs under increasing load and with different backend slot configurations. This addition enhances the benchmarking suite by offering a specific test case for understanding system scalability and performance bottlenecks related to coordination.

Highlights

New Benchmark Experiment (Experiment 1a): Introduced a new benchmark experiment, 'Experiment 1a', to analyze coordination throughput versus invocation rate, focusing on saturation curves for different ensemble sizes.
Configuration Updates: Added a dedicated configuration section for 'experiment_1a' in config.yml and temp-experiment/config/config.yml, specifying ensemble_sizes, tool_invocations_sweep, and tool_execution_duration_time.
Experiment Logic Implementation: Implemented the core logic for 'Experiment 1a' in main.py, including parameter loading, workload execution using LangraphWorkload, and calculation of offered load and throughput.
Plotting Utility: Added a new plotting utility in plots.py specifically for 'Experiment 1a' to generate and save saturation curve plots, visualizing throughput against offered load for various ensemble sizes.
Benchmark Integration: Integrated 'Experiment 1a' into the benchmark runner by importing and registering it with the FlowGenticBenchmarkManager.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

tests/benchmark/config.yml
- Added a new experiment_1a section to define parameters for the new benchmark, including ensemble_sizes, tool_invocations_sweep, and tool_execution_duration_time.
tests/benchmark/data_generation/experiments/experiment_1a/main.py
- New file added, implementing the Experiment1a class which inherits from BaseExperiment.
- Defines the logic for running the saturation curve experiment, including iterating through ensemble sizes and tool invocation sweeps.
- Calculates offered_load and throughput based on workload results.
- Loads experiment-specific configuration from config.yml.
tests/benchmark/data_generation/experiments/experiment_1a/utils/plots.py
- New file added, containing the Experiment1aPlotter class which inherits from BasePlotter.
- Provides functionality to generate and save a saturation curve plot (fig1a_saturation_curve.png) based on the experiment data.
tests/benchmark/data_generation/run_experiments.py
- Imported the newly created Experiment1a class.
- Registered Experiment1a with the FlowGenticBenchmarkManager under the key 'experiment_1a', making it available for execution.
tests/benchmark/data_generation/workload/utils/engine.py
- Removed an unused import statement for ThreadPoolExecutor from autogen.code_utils.
tests/benchmark/results/temp-experiment/config/config.yml
- Added the same experiment_1a configuration as tests/benchmark/config.yml, likely for templating or example purposes.

Activity

No specific activity (comments, reviews, or progress updates) has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new benchmark, Experiment1a, to measure coordination throughput against invocation rate. The changes include the core experiment logic, plotting utilities, and necessary configuration updates. My review focuses on the new Python code, where I've identified several opportunities to improve robustness and prevent potential runtime errors. I've suggested adding checks to handle edge cases like division by zero and empty data series, which could otherwise crash the experiment. Additionally, I've pointed out an incorrect type hint and a risky assumption about input parameters, providing suggestions to make the code more reliable and maintainable.

gemini-code-assist · 2026-02-04T23:24:18Z

tests/benchmark/data_generation/experiments/throughput_saturation/main.py

+				)
+
+				offered_load = total_invocations / self.tool_duration
+				throughput = total_invocations / workload_result.total_makespan


If workload_result.total_makespan is zero, this calculation will raise a ZeroDivisionError, crashing the experiment. While unlikely, it's safer to handle this edge case. A makespan of zero implies infinite throughput.

throughput = ( total_invocations / workload_result.total_makespan if workload_result.total_makespan > 0 else float("inf") )

gemini-code-assist · 2026-02-04T23:24:18Z

tests/benchmark/data_generation/experiments/throughput_saturation/utils/plots.py

+		sorted_series = sorted(
+			data.items(), key=lambda item: item[1][0]["ensemble_size"]
+		)


The sorting logic assumes that every series in data has at least one record (item[1] is not empty). If a series has no records, item[1][0] will raise an IndexError, causing the plotting to fail. It's safer to filter out empty series before sorting.

Suggested change

sorted_series = sorted(

data.items(), key=lambda item: item[1][0]["ensemble_size"]

)

sorted_series = sorted(

(item for item in data.items() if item[1]),

key=lambda item: item[1][0]["ensemble_size"],

)

gemini-code-assist · 2026-02-04T23:24:18Z

tests/benchmark/data_generation/experiments/throughput_saturation/main.py

+	"""
+
+	def __init__(
+		self, benchmark_config: BenchmarkConfig, data_dir: str, plots_dir: str


The type hints for data_dir and plots_dir are specified as str, but they are initialized as pathlib.Path objects in run_experiments.py and are expected to be Path objects by Experiment1aPlotter. To ensure type consistency and prevent potential errors, these should be typed as Path.

Suggested change

self, benchmark_config: BenchmarkConfig, data_dir: str, plots_dir: str

self, benchmark_config: BenchmarkConfig, data_dir: Path, plots_dir: Path

gemini-code-assist · 2026-02-04T23:24:18Z

tests/benchmark/data_generation/experiments/throughput_saturation/main.py

+				# Derive calls_per_tool from total invocations
+				# total_invocations = n_agents * calls_per_tool * N_TOOLS
+				# With n_agents=1: calls_per_tool = total_invocations / N_TOOLS
+				calls_per_tool = total_invocations // N_TOOLS


The calculation for calls_per_tool uses integer division (//), which will truncate the result if total_invocations is not evenly divisible by N_TOOLS. This could lead to a discrepancy between the intended number of invocations and the actual number executed, making performance metrics like throughput misleading. Adding an assertion will ensure this condition is met and prevent silent errors.

assert total_invocations % N_TOOLS == 0, f"total_invocations ({total_invocations}) must be divisible by N_TOOLS ({N_TOOLS})" calls_per_tool = total_invocations // N_TOOLS

…arch/flowgentic into experiment1a

javidsegura · 2026-02-19T19:57:42Z

Make this a PR draft and open actual PR when u have experiment1a code stable

javidsegura and others added 5 commits January 22, 2026 21:49

initial commit diana

e0fd046

Merge hotfix/benchmarking

df14af1

Merge branch 'hotfix/benchmarking_redesign' of github.com:stride-rese…

ee86f0e

…arch/flowgentic into hotfix/benchmarking_redesign

Merge branch 'hotfix/benchmarking_redesign' of github.com:stride-rese…

e15072e

…arch/flowgentic into experiment1a

First iteration of figure 1a

bc56800

gemini-code-assist bot reviewed Feb 4, 2026

View reviewed changes

dianacord added 4 commits February 5, 2026 10:54

Bug fix

60c98ce

Minor refactoring for more intuative names

38e73e1

Merge branch 'hotfix/benchmarking_redesign' of github.com:stride-rese…

8f3356e

…arch/flowgentic into experiment1a

Merge branch 'hotfix/benchmarking_redesign' of github.com:stride-rese…

7d45d99

…arch/flowgentic into experiment1a

dianacord marked this pull request as draft February 23, 2026 10:57

Added version of throughput experiment with queued engine model

613b536

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment1a#94

Experiment1a#94
dianacord wants to merge 10 commits intohotfix/benchmarking_redesignfrom
experiment1a

dianacord commented Feb 4, 2026

Uh oh!

coderabbitai bot commented Feb 4, 2026 •

edited

Loading

Review skipped

Uh oh!

gemini-code-assist bot commented Feb 4, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 4, 2026

Uh oh!

gemini-code-assist bot Feb 4, 2026

Uh oh!

gemini-code-assist bot Feb 4, 2026

Uh oh!

gemini-code-assist bot Feb 4, 2026

Uh oh!

javidsegura commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	self, benchmark_config: BenchmarkConfig, data_dir: str, plots_dir: str
	self, benchmark_config: BenchmarkConfig, data_dir: Path, plots_dir: Path

Conversation

dianacord commented Feb 4, 2026

Uh oh!

coderabbitai bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

gemini-code-assist bot commented Feb 4, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

javidsegura commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Feb 4, 2026 •

edited

Loading