feat: terminal bench #3118

Saedbhati · 2025-09-10T07:29:08Z

Description

Describe your changes in detail (optional if the linked issue already contains a detailed description of the changes).
fixes #2464

Doesn't work on windows and need python3.12

Checklist

Go over all the following points, and put an x in all the boxes that apply.

I have read the CONTRIBUTION guide (required)
I have linked this PR to an issue using the Development section on the right sidebar or by adding Fixes #issue-number in the PR description (required)
I have checked if any dependencies need to be added or updated in pyproject.toml and uv lock
I have updated the tests accordingly (required for a bug fix or a new feature)
I have updated the documentation if needed:
I have added examples if this is a new feature

If you are unsure about any of these, don't hesitate to ask. We are here to help!

coderabbitai · 2025-09-10T07:29:15Z

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch enhance-terminal-bench

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

hesamsheikh · 2025-09-10T16:15:10Z

What is this PR about? what issue does it solve?

Saedbhati · 2025-09-10T16:30:26Z

@hesamsheikh sorry i forget to mention issue, this pr solve #2464

waleedalzarooni

Great work @Saedbhati !

A very comprehensive integration, only included a couple of minor comments to help improve slightly.

waleedalzarooni · 2025-09-12T09:14:18Z

camel/benchmarks/tbench_camel_agent.py

+            if os.path.exists(
+                f"{self.logging_dir}/{container_name}_run{run_id:02d}"
+            ):
+                run_id += 1


I think that this may cause race conditions when multiple processes are running, perhaps consider switching to timestamp based ID's or some atomic directory equivalent!

waleedalzarooni · 2025-09-12T11:03:59Z

camel/toolkits/terminal_toolkit_docker.py

+        """
+        pb = self._previous_buffer.strip()
+        if pb in current_buffer:
+            idx = current_buffer.index(pb)


I think that there may be some indices overlap here. The method is using an index from a previous buffer as if it's an index in the current buffer. The pb.rfind("\n") can potentially return a position within the previous buffer string, but then this position is used to slice the current buffer

Example:
Previous: "hello\nworld" ← idx=5 points to \n here
Current: "hello\nworld\nnew" ← but we use idx=5 here
Result: "world\nnew" ← wrong! includes "world"
Should be: "new" ← correct! just the new part

hesamsheikh · 2025-09-15T18:30:54Z

camel/benchmarks/tbench.py

+if TYPE_CHECKING:
+    from terminal_bench.agents.agent_name import (  # type: ignore[import-untyped]
+        AgentName,
+    )
+    from terminal_bench.agents.base_agent import (  # type: ignore[import-untyped]
+        AgentResult,
+        BaseAgent,
+    )
+    from terminal_bench.handlers.trial_handler import (  # type: ignore[import-untyped]
+        TrialHandler,
+    )
+    from terminal_bench.harness import Harness  # type: ignore[import-untyped]
+    from terminal_bench.terminal.tmux_session import (  # type: ignore[import-untyped]
+        TmuxSession,


Let's add terminal_bench to pyproject.toml

camel/benchmarks/tbench_camel_agent.py

hesamsheikh · 2025-09-15T18:46:11Z

Thanks for the PR @Saedbhati !
I left a few comments above.

Also, have you been able to example? I got the error

NameError: name 'Harness' is not defined

it's probably that Harness is imported inside a TYPE_CHECKING block, which means it's only available during static type checking, not at runtime. But the TerminalBench class tries to inherit from Harness at runtime, hence the NameError.

waleedalzarooni

No major changes so happy with version after comments are carried out

…camel into enhance-terminal-bench

Saedbhati · 2025-09-16T09:21:54Z

@hesamsheikh I have fix the example, you can give it try.

hesamsheikh

Thanks for the fix @Saedbhati
I can run the example. Here are some tiny fixes required.

hesamsheikh · 2025-09-16T12:44:50Z

camel/benchmarks/tbench.py

+        self,
+        output_path: Path,
+        run_id: str,
+        ChatAgent: ChatAgent,


any specific reason this is not snake_case?

hesamsheikh · 2025-09-16T12:44:57Z

examples/benchmarks/tbench.py

+    save_to="tbench_results",
+    processes=1,
+)
+print(TBench_instance.run(camel_agent=camel_agent, subset=2))


this must be

print(TBench_instance.run(agent=camel_agent, subset=2))

hesamsheikh · 2025-09-16T12:45:42Z

camel/benchmarks/tbench.py

+
+        Args:
+            agent: The ChatAgent to use for the benchmark.
+            on: The data split to run the benchmark on.


this must be removed

hesamsheikh · 2025-09-16T12:46:12Z

camel/benchmarks/tbench_camel_agent.py

+            TerminalToolkit(**terminal_toolkit_kwargs).get_tools()  # type: ignore[arg-type]
+        )
+
+        usr_msg = f"{instruction}\n"


unused variable

zjrwtx

thanks @Saedbhati , the advice from @hesamsheikh is nice,and should we update the pyproject.toml? Directory ID generation in a multi-process environment may conflict

zjrwtx · 2025-09-16T13:45:50Z

camel/benchmarks/tbench_camel_agent.py

+        total_output_tokens = response.info['usage']['completion_tokens']
+
+        memory_list = (
+            camel_agent._memory._chat_history_block.storage.memory_list  # type: ignore[attr-defined]


Directly accessing multi-layer private attributes violates the encapsulation principle and is prone to errors when implementing changes internally.

zjrwtx

lets add more docstring below different core clasee and function

zjrwtx · 2025-09-16T13:48:06Z

camel/benchmarks/tbench_camel_agent.py

+        camel_agent: "ChatAgent",
+        logging_dir: Path | None = None,
+    ) -> "AgentResult":
+        """Execute a task using the Terminal Bench harness.


zjrwtx · 2025-09-16T13:48:21Z

camel/benchmarks/tbench_camel_agent.py

+        def create_timestamped_marker_from_memory(
+            records: List[dict],
+        ) -> List[Tuple[float, str]]:
+            """Create a timestamped marker from memory records."""


Saedbhati · 2025-09-16T14:21:39Z

@zjrwbx Thanks a lot for the review. Great to see you back!

enhance pr @3099

a9b54db

Saedbhati changed the base branch from terminal-bench to master September 10, 2025 07:34

Saedbhati changed the title ~~Enhance terminal bench~~ feat: terminal bench Sep 10, 2025

Saedbhati added 2 commits September 10, 2025 13:10

terminal updates

be50be3

update docs string

47b8d92

Wendong-Fan requested review from hesamsheikh, waleedalzarooni and a7m-1st September 10, 2025 08:00

Wendong-Fan assigned Saedbhati Sep 10, 2025

Wendong-Fan added the Review Required PR need to be reviewed label Sep 10, 2025

Merge branch 'master' into enhance-terminal-bench

127a9c9

waleedalzarooni reviewed Sep 12, 2025

View reviewed changes

Saedbhati and others added 4 commits September 15, 2025 16:48

Merge branch 'master' into enhance-terminal-bench

5c84eca

update docstring + comment

9b9ce45

update docstring + comment

463e1f2

precommit fix

e02f1da

Saedbhati requested a review from waleedalzarooni September 15, 2025 13:46

Merge branch 'master' into enhance-terminal-bench

15691b7

hesamsheikh reviewed Sep 15, 2025

View reviewed changes

waleedalzarooni approved these changes Sep 16, 2025

View reviewed changes

Saedbhati added 2 commits September 16, 2025 14:47

fix example

30dcd27

Merge branch 'enhance-terminal-bench' of https://github.com/camel-ai/…

c0ca938

…camel into enhance-terminal-bench

minor fix

4d5377c

Saedbhati requested a review from hesamsheikh September 16, 2025 09:30

hesamsheikh reviewed Sep 16, 2025

View reviewed changes

zjrwtx reviewed Sep 16, 2025

View reviewed changes

Saedbhati and others added 3 commits September 16, 2025 19:52

Merge branch 'master' into enhance-terminal-bench

6e09c6a

Merge branch 'master' into enhance-terminal-bench

36d1f33

update

f8d1a48

Saedbhati requested review from hesamsheikh and zjrwtx September 18, 2025 08:07

feat: terminal bench #3118

Are you sure you want to change the base?

feat: terminal bench #3118

Conversation

Saedbhati commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

coderabbitai bot commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

hesamsheikh commented Sep 10, 2025

Uh oh!

Saedbhati commented Sep 10, 2025

Uh oh!

waleedalzarooni left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hesamsheikh commented Sep 15, 2025

Uh oh!

waleedalzarooni left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Saedbhati commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hesamsheikh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zjrwtx left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zjrwtx left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Saedbhati commented Sep 16, 2025

Uh oh!

Uh oh!

Saedbhati commented Sep 10, 2025 •

edited

Loading

coderabbitai bot commented Sep 10, 2025 •

edited

Loading

waleedalzarooni left a comment •

edited

Loading

Saedbhati commented Sep 16, 2025 •

edited

Loading