Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mains celery worker and RabbitMQ FIFO behavior is not fault-tolerant when plugin instance status checks are not working #558

Open
3 tasks
jennydaman opened this issue Jun 5, 2024 · 0 comments

Comments

@jennydaman
Copy link
Contributor

jennydaman commented Jun 5, 2024

CUBE polls pfcon for the status of plugin instances. However, in certain failure modes, pfcon will not be able to be polled. For example:

  • pfcon is unreachable (temporary failure)
  • the job for the plugin instance was deleted from Kubernetes (permanent failure)

CUBE continues to retry these polls. The problem is that sometimes the failures are permanent failures, and CUBE will be polling the failure indefinitely.

CUBE polls plugin instances in a first-in first-out manner (FIFO). This means if you have too many plugin instances stuck in an errored state, newer plugin instances which are working aren't going to ever be polled.

Suggested solutions

  • CUBE should discriminate between "temporary" failures and "permanent" failures. When a "permanent" failure is encountered, the plugin instance status should be set as "cancelled" (or some other indication of system error)
  • CUBE should use a priority queue instead of a FIFO queue. Repeated polls which give "temporary" failure should be de-prioritized.
  • Document how to configure the polling queue size.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant