-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
bugSomething isn't workingSomething isn't workingquestionFurther information is requestedFurther information is requested
Description
When running a reduced version of our tool-caller agent workflow I noticed what seems like an error in the state handling between tool calls. The run behavior is like this:
- the router works (with manual routing logic), the ignored LLM output looks fine,
- the first tool call run_mpnn appears to work
- The following tool step -- score_mpnn -- fails to update the state. As a result, score_mpnn runs again for some or all pipelines, maybe looping on that tool a time or two and crashing.
That all happens in about two minutes on a single A100 after which the job hangs.
I can't tell exactly where is the disconnect between the successful tool run and the failed state update, but thought maybe you could take a look? My code, Anvil slurm script, and example slurm output are attached.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingquestionFurther information is requestedFurther information is requested