-
Notifications
You must be signed in to change notification settings - Fork 1.6k
feat: terminal bench #3118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat: terminal bench #3118
Conversation
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing touches🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
What is this PR about? what issue does it solve? |
@hesamsheikh sorry i forget to mention issue, this pr solve #2464 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @Saedbhati !
A very comprehensive integration, only included a couple of minor comments to help improve slightly.
if os.path.exists( | ||
f"{self.logging_dir}/{container_name}_run{run_id:02d}" | ||
): | ||
run_id += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this may cause race conditions when multiple processes are running, perhaps consider switching to timestamp based ID's or some atomic directory equivalent!
""" | ||
pb = self._previous_buffer.strip() | ||
if pb in current_buffer: | ||
idx = current_buffer.index(pb) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that there may be some indices overlap here. The method is using an index from a previous buffer as if it's an index in the current buffer. The pb.rfind("\n")
can potentially return a position within the previous buffer string, but then this position is used to slice the current buffer
Example:
Previous: "hello\nworld" ← idx=5 points to \n here
Current: "hello\nworld\nnew" ← but we use idx=5 here
Result: "world\nnew" ← wrong! includes "world"
Should be: "new" ← correct! just the new part
camel/benchmarks/tbench.py
Outdated
if TYPE_CHECKING: | ||
from terminal_bench.agents.agent_name import ( # type: ignore[import-untyped] | ||
AgentName, | ||
) | ||
from terminal_bench.agents.base_agent import ( # type: ignore[import-untyped] | ||
AgentResult, | ||
BaseAgent, | ||
) | ||
from terminal_bench.handlers.trial_handler import ( # type: ignore[import-untyped] | ||
TrialHandler, | ||
) | ||
from terminal_bench.harness import Harness # type: ignore[import-untyped] | ||
from terminal_bench.terminal.tmux_session import ( # type: ignore[import-untyped] | ||
TmuxSession, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add terminal_bench
to pyproject.toml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @Saedbhati ! Also, have you been able to example? I got the error
it's probably that Harness is imported inside a TYPE_CHECKING block, which means it's only available during static type checking, not at runtime. But the TerminalBench class tries to inherit from Harness at runtime, hence the NameError. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No major changes so happy with version after comments are carried out
…camel into enhance-terminal-bench
@hesamsheikh I have fix the example, you can give it try. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix @Saedbhati
I can run the example. Here are some tiny fixes required.
camel/benchmarks/tbench.py
Outdated
self, | ||
output_path: Path, | ||
run_id: str, | ||
ChatAgent: ChatAgent, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any specific reason this is not snake_case?
examples/benchmarks/tbench.py
Outdated
save_to="tbench_results", | ||
processes=1, | ||
) | ||
print(TBench_instance.run(camel_agent=camel_agent, subset=2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this must be
print(TBench_instance.run(agent=camel_agent, subset=2))
camel/benchmarks/tbench.py
Outdated
Args: | ||
agent: The ChatAgent to use for the benchmark. | ||
on: The data split to run the benchmark on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this must be removed
TerminalToolkit(**terminal_toolkit_kwargs).get_tools() # type: ignore[arg-type] | ||
) | ||
|
||
usr_msg = f"{instruction}\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @Saedbhati , the advice from @hesamsheikh is nice,and should we update the pyproject.toml? Directory ID generation in a multi-process environment may conflict
total_output_tokens = response.info['usage']['completion_tokens'] | ||
|
||
memory_list = ( | ||
camel_agent._memory._chat_history_block.storage.memory_list # type: ignore[attr-defined] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Directly accessing multi-layer private attributes violates the encapsulation principle and is prone to errors when implementing changes internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets add more docstring below different core clasee and function
camel_agent: "ChatAgent", | ||
logging_dir: Path | None = None, | ||
) -> "AgentResult": | ||
"""Execute a task using the Terminal Bench harness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r""""""
def create_timestamped_marker_from_memory( | ||
records: List[dict], | ||
) -> List[Tuple[float, str]]: | ||
"""Create a timestamped marker from memory records.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r""""""
@zjrwbx Thanks a lot for the review. Great to see you back! |
Description
Describe your changes in detail (optional if the linked issue already contains a detailed description of the changes).
fixes #2464
Doesn't work on windows and need python3.12
Checklist
Go over all the following points, and put an
x
in all the boxes that apply.Fixes #issue-number
in the PR description (required)pyproject.toml
anduv lock
If you are unsure about any of these, don't hesitate to ask. We are here to help!