Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration tests or benchmarks? #3

Open
AidanTilgner opened this issue Aug 26, 2024 · 0 comments
Open

Integration tests or benchmarks? #3

AidanTilgner opened this issue Aug 26, 2024 · 0 comments

Comments

@AidanTilgner
Copy link

In reference to this pr and the mention of integration tests to track updates and performance, I was wondering whether integration tests for this kind of project should look more like benchmarks.

Like, there's always a sort of pass/fail test that could be run to tell us whether something still does work. However, with a non-deterministic system such as an agent, it seems like testing how well the system "works", needs to be a bit more broad. Essentially, I'm thinking of a test suite, which would contain a series of pass/fail tests, each which would be designed to test a specific criteria. Then, whenever changes were made to UC (or other agents for that matter), they could be tested against the benchmark, to see if the update improved the performance or not.

There are definitely benchmarks which exist already, and I'm gonna look around some more. However, I haven't seen very comprehensive ones, and either way UC would need an adapter to actually utilize an existing one. Either way, I'm thinking that I'm going to see about getting UC hooked up to some benchmarking suite, so that something like an update to the file editing tool will show either performance improvement or detriment.

I think I'm going to try to work out a manual benchmark first, just getting things down conceptually to start. Then, I can see about automating it.

So far I'm thinking about covering these key areas:

  • Tools
    • Tool Selection: How well does the agent select which tool to use in a specific situation?
    • Tool Use: How well does the agent use a tool for maximum effectiveness?
      • Editing: How well does it use editing tools
      • Commands: How well does it execute commands which are fitting for the task
  • Reasoning
    • Plan Formation: How well does it form specific plans
    • Steps Required: How many steps between task initialization and completion?
    • Task Understanding: How well does it understand instructions, versus misinterpreting them?
    • Redundancy: How often are actions repeated when they don't need to be?
    • Multi-Step Operations: How well are multi-step operations performed?
    • Assumptions: Does the model make assumptions about tasks which could lead to accuracy deficits?
      • Assumption Rate: Given a task, how many assumptions are made about the nature of the task?
      • Assumption Accuracy: Given a task, how accurate are the assumption which are made?
  • Code:
    • Quality: How is the code quality based on several metrics?
    • Optimization: How well can it optimize existing code?
    • Testing: How well does the agent test its code?
      • Test Accuracy: How often do the tests fail when they shouldn't?
  • Robustness:
    • Inter-model Ability: How well does the model perform when using different models

If memory gets added

  • Memory:
    • Memory Creation: How well does it decide to make memories
    • Memory Recall: How well does it pull old memories accurately to working memory
    • Memory Management: How well is the line between proposed and working memory handled?

But this is kinda a rough draft. I'm opening this issue as a reference point and start of a discussion around this topic!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@AidanTilgner and others