Skip to content

Tasks retry after failures (API Rate Limit / Model Throttling) #3233

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

T4n17
Copy link

@T4n17 T4n17 commented Jul 29, 2025

This feature would help avoid the crew getting stuck after a failing task, caused by an API rate limit error or a model throttling error

raise e # Re-raise the exception after emitting the event
if self.number_of_retries_remaining_after_failure > 0:
# Retrying Task execution after failure
time.sleep(self.max_delay_after_failure)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time.sleep is a blocking call in multithreaded systems and may degrade performance in CrewAI's parallel executions. Consider replacing it with a non-blocking retry or scheduling mechanism.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this. I used a threading timer instead of blocking sleep function

Comment on lines 164 to 166
number_of_retries_remaining_after_failure: int = Field(
default=max_retries_after_failure.default, description="Number of retries remaining after a Task failure"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might generate Runtime execution error, isn't?

Copy link
Author

@T4n17 T4n17 Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. I set a fixed default value (5)

self.number_of_retries_remaining_after_failure -= 1
return self._execute_core(agent, context, tools)
else:
crewai_event_bus.emit(self, TaskCompletedEvent(output=TaskOutput(description="Task failed", agent=self.agent.role), task=self))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not emit TaskCompleted. Consider raising an error instead

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raising the error will just cause the crew to hang, since it does waits for all the task to complete. I changed this by returning a TaskOutput describing the error instead of emitting the TaskCompletedEvent

@T4n17 T4n17 requested a review from lucasgomide August 6, 2025 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants