-
Notifications
You must be signed in to change notification settings - Fork 4
feat: AEP 7: Queue system interface design #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
38877e6 to
4e17046
Compare
f314866 to
55fd459
Compare
e95a3f7 to
c05fede
Compare
c05fede to
ed8dd33
Compare
wkirschenmann
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that the tone of the text is more "justification-oriented" than explaining the rationale of the choices. Even though event-driven architecture are plebicitized these day, it is not a target for all use cases.
| ## Functional Requirements | ||
|
|
||
| To ensure optimal message handling within ArmoniK, strict control over the number of messages being processed simultaneously is required. | ||
| Each message represents a task that can be executed, necessitating fine-grained lifecycle management. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The explanation is below. The paragraphs structure doesn't show that.
| Each message represents a task that can be executed, necessitating fine-grained lifecycle management. | ||
|
|
||
| When a message is received, the associated task begins processing immediately. | ||
| The message is acknowledged only upon the successful or unsuccessful completion of the task. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"acknowledged" should be explained first.
You can also replace with "removed" if you don't want to do that.
| The main motivation for the current architecture is to maintain a historical interface that provides comprehensive traceability and monitoring of tasks and message processing. This interface is central to operational observability and is a key requirement for the system. | ||
|
|
||
| # Rationale | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ArmoniK, the queue system contains messages representing the tasks ready to be executed (i.e. their input data are all available). In ArmoniK, the messages are processed by several Scheduling Agents.
Simplified message processing Algorithm
- Acquire the message
- Process the task
- Prepare the task
- Execute the task
- Complete the task
- Acknowledge the message
| To ensure optimal message handling within ArmoniK, strict control over the number of messages being processed simultaneously is required. | ||
| Each message represents a task that can be executed, necessitating fine-grained lifecycle management. | ||
|
|
||
| When a message is received, the associated task begins processing immediately. | ||
| The message is acknowledged only upon the successful or unsuccessful completion of the task. | ||
| It the task completes successfully, the message is acknowledged after tasks results are uploaded to the storage and tasks created by the task in processing are submitted. | ||
| If the queue service loses connection with the processing agent, the message is redelivered, ensuring task execution even in the presence of errors. | ||
|
|
||
| Additionally, tasks that are pending can be released when a long-running task is in progress. | ||
| ArmoniK initiates the processing of a new message only when the current task has been dispatched to a worker, starting the acquisition of a new task during the processing of the previous one. | ||
| This design allows to have a task ready for execution as soon as the previous task ends. | ||
| Tasks undergoing retry are treated as entirely new tasks, which simplifies tracking and execution, generating new messages accordingly. | ||
|
|
||
| Message uniqueness is not required as it is managed elsewhere. | ||
| Furthermore, message scheduling must be handled within the queue service to accommodate prioritization mechanisms. | ||
| The system should also allow for the seamless integration of new plugins to enhance flexibility and adaptability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| To ensure optimal message handling within ArmoniK, strict control over the number of messages being processed simultaneously is required. | |
| Each message represents a task that can be executed, necessitating fine-grained lifecycle management. | |
| When a message is received, the associated task begins processing immediately. | |
| The message is acknowledged only upon the successful or unsuccessful completion of the task. | |
| It the task completes successfully, the message is acknowledged after tasks results are uploaded to the storage and tasks created by the task in processing are submitted. | |
| If the queue service loses connection with the processing agent, the message is redelivered, ensuring task execution even in the presence of errors. | |
| Additionally, tasks that are pending can be released when a long-running task is in progress. | |
| ArmoniK initiates the processing of a new message only when the current task has been dispatched to a worker, starting the acquisition of a new task during the processing of the previous one. | |
| This design allows to have a task ready for execution as soon as the previous task ends. | |
| Tasks undergoing retry are treated as entirely new tasks, which simplifies tracking and execution, generating new messages accordingly. | |
| Message uniqueness is not required as it is managed elsewhere. | |
| Furthermore, message scheduling must be handled within the queue service to accommodate prioritization mechanisms. | |
| The system should also allow for the seamless integration of new plugins to enhance flexibility and adaptability. | |
| 1. The Queue System should allow for **multiple Scheduling Agent** to connect and get messages. While it should provide each Scheduling Agent with a different message, it need not to provide a guarantee of uniqueness of distribution. | |
| 2. The Queue System should **not lose any message**. | |
| 3. As long as there are messages in the Queue System, the Scheduling Agents should receive messages, i.e. they cannot starve if there are messages in the queue. | |
| 4. The Queue System should allow the Scheduling Agents to have a **strict control over the number of messages acquired**. This will allow the Scheduling Agent to parallelize message processing, either through pipelining (e.g. prepare a task while executing another task) or by executing several tasks simultaneously. | |
| 5. The Queue System should allow to **detect when a Scheduling Agent is not responsive** and provide the messages to another Scheduling Agent. This feature is at the core of the resilience mechanisms of ArmoniK and ensures that all tasks will be processed even if there are errors on the executing nodes. | |
| 6. The Queue System should **allow a Scheduling Agent to "liberate" a message**, i.e. the message is signaled for the Queue System to provide the message to another Scheduling Agent. When using a pipelined parallelization, this feature can be used to avoid that a message waits for a too long time when a long task is being executed. | |
| While the properties 1, 2, and 3 are inherent to the solution chosen, the other features should be exposed through the interface. |
| The current design prioritizes simplicity and robustness over a pure event-driven model. Considering the long duration of tasks and the cost of orchestration, some inefficiencies during message reception are acceptable. These inefficiencies are preferable to the development overhead and technical debt involved in transforming the existing pull-based mechanism. Rewriting the codebase to adopt a fully event-driven, push-based pattern is not feasible given the complexity and scale of such a transformation. | ||
|
|
||
| The technical limitations of the underlying queueing service also play an important role in the design decisions. SQS, for instance, does not provide an API for push-based message reception, which necessitates to use of a pull-based approach. Additionally, past attempts to use RabbitMQ for push-based message handling revealed significant stability issues, such as connection losses and inconsistent message processing. These challenges have further reinforced the decision to rely on the pull-based model, which offers greater reliability and predictability. | ||
|
|
||
| We can consider converting a push-based reception to messages into our pull-based interfaces. Naturally, this implementation requires to be proved as stable and as scalable as the current one. Pull requests are welcome to improve the implementation of our queue system adaptors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The current design prioritizes simplicity and robustness over a pure event-driven model. Considering the long duration of tasks and the cost of orchestration, some inefficiencies during message reception are acceptable. These inefficiencies are preferable to the development overhead and technical debt involved in transforming the existing pull-based mechanism. Rewriting the codebase to adopt a fully event-driven, push-based pattern is not feasible given the complexity and scale of such a transformation. | |
| The technical limitations of the underlying queueing service also play an important role in the design decisions. SQS, for instance, does not provide an API for push-based message reception, which necessitates to use of a pull-based approach. Additionally, past attempts to use RabbitMQ for push-based message handling revealed significant stability issues, such as connection losses and inconsistent message processing. These challenges have further reinforced the decision to rely on the pull-based model, which offers greater reliability and predictability. | |
| We can consider converting a push-based reception to messages into our pull-based interfaces. Naturally, this implementation requires to be proved as stable and as scalable as the current one. Pull requests are welcome to improve the implementation of our queue system adaptors. | |
| ### Event-driven approach (push) | |
| Using an event-driven approach allows the Queue System to push the messages to the Scheduling Agent. The control of "when the messages arrives in the Scheduling Agent" is moved to the Queue system. This allows to reduce the latency between the submission of a message in the Queue System and its dispatch in the Scheduling Agent. On the other hand, managing the number of messages received by the Scheduling Agent is more complex, or impossible if the number of messages allowed in each Scheduling Agent can change frequently over time. | |
| ### Pull based approach | |
| Using a pull-based approach, the Scheduling Agent has to call the Queue System to try and get a new message. This allows the Scheduling Agent to control when a new message is obtained. However, this increases the latency to get new messages. | |
| Note that in the Scheduling Agent pipelined implementation, polling is used on internal queues to regulate the number of message being in each step of processing. | |
| ### Choice | |
| The technical limitations of the underlying queueing service also play an important role in the design decisions. SQS, for instance, does not provide an API for push-based message reception, which necessitates to use of a pull-based approach. A library can expose a push-based interface while using a pull-based Queue System (or vice-versa). | |
| In the end, we believe that the choice made should be the result of two main constraints : | |
| * a pipelined implementation | |
| * a need to minimize the number of messages stuck in the steps prior the task execution | |
| These constraints are more easily taken into account with a pull-based approach as the information to decide whether or not a new message should enter the pipeline are available in the Scheduling Agent rather than the Queue system. | |
| Additionally, past experiments to use RabbitMQ for push-based message handling revealed significant stability issues, such as connection losses and inconsistent message processing. These challenges have further reinforced the decision to rely on the pull-based model, which offers greater reliability and predictability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is really crucial to explain long polling as it is a good tradeoff between the two, and it is what is actually used.
|
|
||
| # Conclusion | ||
|
|
||
| The architecture reflects a deliberate balance between design complexity, operational requirements, and system constraints. By combining a historical interface with strict control over message processing and lifecycle management, ArmoniK achieves a reliable and scalable system. Although the system does not adopt a fully event-driven approach, it remains robust and resilient, meeting the demands of long-running tasks and orchestration challenges. The trade-offs in this design are justified by the system’s operational stability and its ability to handle large-scale workloads efficiently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The architecture reflects a deliberate balance between design complexity, operational requirements, and system constraints. By combining a historical interface with strict control over message processing and lifecycle management, ArmoniK achieves a reliable and scalable system. Although the system does not adopt a fully event-driven approach, it remains robust and resilient, meeting the demands of long-running tasks and orchestration challenges. The trade-offs in this design are justified by the system’s operational stability and its ability to handle large-scale workloads efficiently. | |
| The architecture reflects a deliberate balance between design complexity, operational requirements, and system constraints. The interface reflects an architecture that allows a strict and precise control over message processing and lifecycle management. This allows ArmoniK to achieve a reliable and scalable system. The trade-offs in this design are justified by the system’s operational stability and its ability to handle large-scale workloads efficiently. |
Although the system does not adopt a fully event-driven approach, it remains robust and resilient, meeting the demands of long-running tasks and orchestration challenges. has been removed as it could be interpreted that a full event-driven approach should be a target. This is not the case.
lemaitre-aneo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think more explanations should be given related to the queue models, and explain more in details the following concepts:
Pulling, Pushing, Polling, Long Polling, Event-based, Acquire, Acknowledge
| Furthermore, message scheduling must be handled within the queue service to accommodate prioritization mechanisms. | ||
| The system should also allow for the seamless integration of new plugins to enhance flexibility and adaptability. | ||
|
|
||
| ## Possible Approaches for the Interface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to talk about long polling
| The current design prioritizes simplicity and robustness over a pure event-driven model. Considering the long duration of tasks and the cost of orchestration, some inefficiencies during message reception are acceptable. These inefficiencies are preferable to the development overhead and technical debt involved in transforming the existing pull-based mechanism. Rewriting the codebase to adopt a fully event-driven, push-based pattern is not feasible given the complexity and scale of such a transformation. | ||
|
|
||
| The technical limitations of the underlying queueing service also play an important role in the design decisions. SQS, for instance, does not provide an API for push-based message reception, which necessitates to use of a pull-based approach. Additionally, past attempts to use RabbitMQ for push-based message handling revealed significant stability issues, such as connection losses and inconsistent message processing. These challenges have further reinforced the decision to rely on the pull-based model, which offers greater reliability and predictability. | ||
|
|
||
| We can consider converting a push-based reception to messages into our pull-based interfaces. Naturally, this implementation requires to be proved as stable and as scalable as the current one. Pull requests are welcome to improve the implementation of our queue system adaptors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is really crucial to explain long polling as it is a good tradeoff between the two, and it is what is actually used.
| Waiting, | ||
|
|
||
| /// <summary> | ||
| /// Message processing has failed. The message should be put back at the begin of the queue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's how it is currently written in our code.
No description provided.