-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Tasks retry after failures (API Rate Limit / Model Throttling) #3233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
src/crewai/task.py
Outdated
raise e # Re-raise the exception after emitting the event | ||
if self.number_of_retries_remaining_after_failure > 0: | ||
# Retrying Task execution after failure | ||
time.sleep(self.max_delay_after_failure) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time.sleep is a blocking call in multithreaded systems and may degrade performance in CrewAI's parallel executions. Consider replacing it with a non-blocking retry or scheduling mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this. I used a threading timer instead of blocking sleep function
src/crewai/task.py
Outdated
number_of_retries_remaining_after_failure: int = Field( | ||
default=max_retries_after_failure.default, description="Number of retries remaining after a Task failure" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might generate Runtime execution error, isn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. I set a fixed default value (5)
src/crewai/task.py
Outdated
self.number_of_retries_remaining_after_failure -= 1 | ||
return self._execute_core(agent, context, tools) | ||
else: | ||
crewai_event_bus.emit(self, TaskCompletedEvent(output=TaskOutput(description="Task failed", agent=self.agent.role), task=self)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not emit TaskCompleted
. Consider raising an error instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raising the error will just cause the crew to hang, since it does waits for all the task to complete. I changed this by returning a TaskOutput describing the error instead of emitting the TaskCompletedEvent
This feature would help avoid the crew getting stuck after a failing task, caused by an API rate limit error or a model throttling error