Skip to content

Experiment1a#94

Draft
dianacord wants to merge 10 commits intohotfix/benchmarking_redesignfrom
experiment1a
Draft

Experiment1a#94
dianacord wants to merge 10 commits intohotfix/benchmarking_redesignfrom
experiment1a

Conversation

@dianacord
Copy link
Copy Markdown
Collaborator

No description provided.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 4, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch experiment1a

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @dianacord, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new benchmark experiment, 'Experiment 1a', designed to analyze the coordination throughput of the system under varying invocation rates and ensemble sizes. The experiment aims to generate saturation curves, providing insights into how the system performs under increasing load and with different backend slot configurations. This addition enhances the benchmarking suite by offering a specific test case for understanding system scalability and performance bottlenecks related to coordination.

Highlights

  • New Benchmark Experiment (Experiment 1a): Introduced a new benchmark experiment, 'Experiment 1a', to analyze coordination throughput versus invocation rate, focusing on saturation curves for different ensemble sizes.
  • Configuration Updates: Added a dedicated configuration section for 'experiment_1a' in config.yml and temp-experiment/config/config.yml, specifying ensemble_sizes, tool_invocations_sweep, and tool_execution_duration_time.
  • Experiment Logic Implementation: Implemented the core logic for 'Experiment 1a' in main.py, including parameter loading, workload execution using LangraphWorkload, and calculation of offered load and throughput.
  • Plotting Utility: Added a new plotting utility in plots.py specifically for 'Experiment 1a' to generate and save saturation curve plots, visualizing throughput against offered load for various ensemble sizes.
  • Benchmark Integration: Integrated 'Experiment 1a' into the benchmark runner by importing and registering it with the FlowGenticBenchmarkManager.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • tests/benchmark/config.yml
    • Added a new experiment_1a section to define parameters for the new benchmark, including ensemble_sizes, tool_invocations_sweep, and tool_execution_duration_time.
  • tests/benchmark/data_generation/experiments/experiment_1a/main.py
    • New file added, implementing the Experiment1a class which inherits from BaseExperiment.
    • Defines the logic for running the saturation curve experiment, including iterating through ensemble sizes and tool invocation sweeps.
    • Calculates offered_load and throughput based on workload results.
    • Loads experiment-specific configuration from config.yml.
  • tests/benchmark/data_generation/experiments/experiment_1a/utils/plots.py
    • New file added, containing the Experiment1aPlotter class which inherits from BasePlotter.
    • Provides functionality to generate and save a saturation curve plot (fig1a_saturation_curve.png) based on the experiment data.
  • tests/benchmark/data_generation/run_experiments.py
    • Imported the newly created Experiment1a class.
    • Registered Experiment1a with the FlowGenticBenchmarkManager under the key 'experiment_1a', making it available for execution.
  • tests/benchmark/data_generation/workload/utils/engine.py
    • Removed an unused import statement for ThreadPoolExecutor from autogen.code_utils.
  • tests/benchmark/results/temp-experiment/config/config.yml
    • Added the same experiment_1a configuration as tests/benchmark/config.yml, likely for templating or example purposes.
Activity
  • No specific activity (comments, reviews, or progress updates) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new benchmark, Experiment1a, to measure coordination throughput against invocation rate. The changes include the core experiment logic, plotting utilities, and necessary configuration updates. My review focuses on the new Python code, where I've identified several opportunities to improve robustness and prevent potential runtime errors. I've suggested adding checks to handle edge cases like division by zero and empty data series, which could otherwise crash the experiment. Additionally, I've pointed out an incorrect type hint and a risky assumption about input parameters, providing suggestions to make the code more reliable and maintainable.

)

offered_load = total_invocations / self.tool_duration
throughput = total_invocations / workload_result.total_makespan
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If workload_result.total_makespan is zero, this calculation will raise a ZeroDivisionError, crashing the experiment. While unlikely, it's safer to handle this edge case. A makespan of zero implies infinite throughput.

				throughput = (
					total_invocations / workload_result.total_makespan
					if workload_result.total_makespan > 0
					else float("inf")
				)

Comment on lines +33 to +35
sorted_series = sorted(
data.items(), key=lambda item: item[1][0]["ensemble_size"]
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The sorting logic assumes that every series in data has at least one record (item[1] is not empty). If a series has no records, item[1][0] will raise an IndexError, causing the plotting to fail. It's safer to filter out empty series before sorting.

Suggested change
sorted_series = sorted(
data.items(), key=lambda item: item[1][0]["ensemble_size"]
)
sorted_series = sorted(
(item for item in data.items() if item[1]),
key=lambda item: item[1][0]["ensemble_size"],
)

"""

def __init__(
self, benchmark_config: BenchmarkConfig, data_dir: str, plots_dir: str
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The type hints for data_dir and plots_dir are specified as str, but they are initialized as pathlib.Path objects in run_experiments.py and are expected to be Path objects by Experiment1aPlotter. To ensure type consistency and prevent potential errors, these should be typed as Path.

Suggested change
self, benchmark_config: BenchmarkConfig, data_dir: str, plots_dir: str
self, benchmark_config: BenchmarkConfig, data_dir: Path, plots_dir: Path

# Derive calls_per_tool from total invocations
# total_invocations = n_agents * calls_per_tool * N_TOOLS
# With n_agents=1: calls_per_tool = total_invocations / N_TOOLS
calls_per_tool = total_invocations // N_TOOLS
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The calculation for calls_per_tool uses integer division (//), which will truncate the result if total_invocations is not evenly divisible by N_TOOLS. This could lead to a discrepancy between the intended number of invocations and the actual number executed, making performance metrics like throughput misleading. Adding an assertion will ensure this condition is met and prevent silent errors.

                assert total_invocations % N_TOOLS == 0, f"total_invocations ({total_invocations}) must be divisible by N_TOOLS ({N_TOOLS})"
				calls_per_tool = total_invocations // N_TOOLS

@javidsegura
Copy link
Copy Markdown
Contributor

Make this a PR draft and open actual PR when u have experiment1a code stable

@dianacord dianacord marked this pull request as draft February 23, 2026 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants