Skip to content

Hotfix/benchmarking redesign#90

Draft
javidsegura wants to merge 52 commits intomainfrom
hotfix/benchmarking_redesign
Draft

Hotfix/benchmarking redesign#90
javidsegura wants to merge 52 commits intomainfrom
hotfix/benchmarking_redesign

Conversation

@javidsegura
Copy link
Copy Markdown
Contributor

@javidsegura javidsegura commented Jan 19, 2026

initial set of plots for laptop config for strong scaling only

See the plots at: "results/FIRST RUN/plots

Contains:

  1. New data generation script (framework agnostic, integrates new agentic abstractions)
  2. Configuration file for easy modification of experiment variables
  3. Analyze module that plots for efficiency and speedup
  4. General modular refactoring of the benchmarking dir

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 19, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e8d65bb7-90af-406a-b5f1-08fac3cfaf4b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch hotfix/benchmarking_redesign
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.

Change the reviews.profile setting to assertive to make CodeRabbit's nitpick more issues in your PRs.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @javidsegura, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant redesign of the project's agent architecture and establishes a robust benchmarking framework. The core changes involve creating a more modular and extensible agent system by abstracting its reasoning, memory, and execution components. A key part of this is the integration of 'radical.asyncflow' as a high-performance backend for concurrent tool execution. Concurrently, a new benchmarking suite has been added to systematically measure and analyze the performance characteristics of the agent, particularly focusing on strong scaling, with initial results already generated. This refactor aims to improve the maintainability, testability, and performance analysis capabilities of the agent system.

Highlights

  • New Agent Core Architecture: Introduced a modular 'Agent' class with distinct 'Reasoner', 'ModelProvider', 'MemoryManager', and 'BaseEngine' components for improved clarity and extensibility in agent design.
  • Radical AsyncFlow Integration: Implemented 'AsyncFlowEngine' to leverage 'radical.asyncflow' for efficient, concurrent execution of agent tools, abstracting the backend logic and enabling high-performance operations.
  • Comprehensive Benchmarking Framework: Added a new 'tests/benchmark' directory with tools for running strong scaling experiments, collecting performance data, and generating speedup/efficiency plots to systematically analyze agent performance.
  • Major Code Refactoring and Cleanup: Old 'langGraph' and 'settings' related modules have been moved to an 'old/' directory, signifying a shift towards the new 'core' architecture and streamlining the codebase.
  • Dependency Management Update: Switched to 'uv' for virtual environment and package management, and updated 'pyproject.toml' to include new dependencies like 'matplotlib' and 'cloudpickle' for benchmarking and data visualization.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant redesign, moving old code to an old directory and establishing a new, more abstract core for agents, reasoners, and backend engines. A comprehensive benchmarking suite has also been added to test scaling, which is a great addition. My review focuses on improving type consistency, fixing a few bugs, and offering suggestions to enhance maintainability and configuration.

I am having trouble creating individual review comments. Click here to see my feedback.

src/flowgentic/backend_engines/base.py (8)

high

There is a type mismatch for the tools_to_use parameter between the base class and its implementation. Here in BaseEngine, it's defined as List[Tuple[str, Dict]], but in AsyncFlowEngine, it's List[ChatCompletionMessageFunctionToolCall]. This violates the Liskov Substitution Principle and can lead to type-related errors. Please unify the type hint across the base class and its implementations.

src/flowgentic/core/agent.py (37)

high

A PromptInput object is being passed to state.add_user_input(), but this method expects a str. This will raise a TypeError at runtime. You should pass the user_input attribute of the prompt_input object instead.

state.add_user_input(prompt_input.user_input)

.gitignore (44-48)

medium

Generated files, such as benchmark results, plots, and temporary files from examples, should be ignored by git. The current changes remove im-working.txt from being ignored, which is created by an example. Additionally, the new benchmark results under tests/benchmark/results/ are being tracked.

I suggest adding these paths to your .gitignore to keep the repository clean from generated artifacts.

Makefile (42-62)

medium

Several useful developer commands such as format, lint, and tests have been removed from the Makefile. While the installation process has been updated to use uv, removing these commands might hinder development workflow and CI/CD processes. Was this removal intentional? If not, I recommend re-adding them.

examples/langgraph-integration/design_patterns/chatbot/toy.py (75)

medium

There is a typo in the function name deterministic_task_intetrnal. It should be deterministic_task_internal.

async def deterministic_task_internal(state: WorkflowState):

examples/langgraph_asyncflow/main.py (48)

medium

There's a typo in the model_id. It should be dummy_model instead of dummy_moodel.

model_id="dummy/dummy_model",

pyproject.toml (98)

medium

For better reproducibility, it's recommended to pin git dependencies to a specific commit hash or tag instead of a branch like main. This ensures that everyone uses the exact same version of the dependency, avoiding potential issues from future changes on the branch.

radical-asyncflow = { git = "https://github.com/radical-cybertools/radical.asyncflow.git", rev = "<COMMIT_HASH_OR_TAG>" }

src/flowgentic/init.py (5)

medium

The logger level is hardcoded to "debug". While this is useful for development, it's not ideal for production or other environments. It would be better to make this configurable, for example, by reading from an environment variable. This would allow changing the log level without modifying the code.

colorful_output=True, logger_level=os.getenv("LOG_LEVEL", "debug")

src/flowgentic/core/models/model_provider.py (3)

medium

This import statement is a duplicate of the one on line 1. It should be removed.

src/flowgentic/core/tool/tool.py (21)

medium

The get_schema method does a good job for simple types. However, it doesn't fully support complex types like List[T]. For a parameter annotated as List[str], it will generate {"type": "array"} but will miss the "items": {"type": "string"} part, which is important for schema validation and for the LLM to understand the expected structure. Consider enhancing this method to handle nested types.

tests/benchmark/data_generation/workload/manager.py (30)

medium

A print statement is used here for debugging. It's better practice to use the logging framework (logger.debug(...)) for such messages. This allows for consistent log formatting and the ability to control log verbosity through configuration.

logger.debug(f"NUMBER OF AGENTS {self.n_of_agents}")

tests/benchmark/data_generation/workload/langgraph_asyncflow.py (36)

medium

The workload simulation uses ProcessPoolExecutor. For tasks that are I/O-bound, like the asyncio.sleep in your tool, ThreadPoolExecutor is generally more efficient. ProcessPoolExecutor incurs higher overhead due to inter-process communication and data serialization (pickling), which might not be necessary here and could skew benchmark results. If the goal is to simulate CPU-bound work, this is appropriate, but for I/O-bound tasks, consider switching to ThreadPoolExecutor.

@@ -72,7 +72,7 @@ async def traffic_extractor(city: str):
@agents_manager.execution_wrappers.asyncflow(
flow_type=AsyncFlowType.EXECUTION_BLOCK
)
async def deterministic_task_internal(state: WorkflowState):
async def deterministic_task_intetrnal(state: WorkflowState):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this meant to be like this? looks like a typo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants