From 4e17046405ae0c1146292281917aa38b1112f10e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Gurhem?= Date: Wed, 15 Jan 2025 09:03:50 +0100 Subject: [PATCH 1/4] feat: AEP 7: Queue system interface design --- AEP/aep-00007.md | 42 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 42 insertions(+) create mode 100644 AEP/aep-00007.md diff --git a/AEP/aep-00007.md b/AEP/aep-00007.md new file mode 100644 index 0000000..66e1c90 --- /dev/null +++ b/AEP/aep-00007.md @@ -0,0 +1,42 @@ +# AEP 7: Queue system interface design + +| |ArmoniK Enhancement Proposal| +|---: |:--- | +| **AEP** | 7 | +| **Title** | Queue system interface design | +| **Author** | Jérôme Gurhem <> | +| **Status** | Active | +| **Type** | Standard | +| **Creation Date** | 2025-01-15 | + +# Abstract + +This proposal outlines the rationale and design decisions behind the current message processing and lifecycle management system in ArmoniK. It highlights the primary reasons for adopting a pull-based model, discusses design constraints, and evaluates the trade-offs involved in ensuring reliability and robustness for long-running tasks. + +# Motivation + +The main motivation for the current architecture is to maintain a historical interface that provides comprehensive traceability and monitoring of tasks and message processing. This interface is central to operational observability and is a key requirement for the system. + +# Rationale + +## Strict Control of Message Processing + +The system is designed to enforce strict control over the number of messages being processed and their lifecycle. Each message in the system represents an executable task with a well-defined processing flow. An executable task is a task whose dependencies were completed, thus being queued for execution. When a message is received, the corresponding task begins execution. The message is acknowledged only after the task completes, regardless of whether it succeeds or fails. In cases where the queueing service loses its connection with the agent before the message is acknowledged, the system ensures reliability by redelivering the message. This guarantees that tasks are not lost and are executed even when errors or interruptions occur. + +Retries for tasks are handled by treating them as new tasks. Each retry is introduced as a new message, which simplifies tracking and execution. Messages are acknowledged when the task completes successfully and subtasks are submitted. ArmoniK processes new messages only after the current task has been dispatched to a worker, starting the acquisition of a new task during the processing of the previous one. This design allows to have a task ready for execution as soon as the previous task ends. + +## Trade-offs and Constraints + +Rewriting the codebase to adopt a fully event-driven, push-based pattern is not feasible given the complexity and scale of such a transformation. The current design prioritizes simplicity and robustness over a pure event-driven model. Considering the long duration of tasks and the cost of orchestration, some inefficiencies during message reception are acceptable. These inefficiencies are preferable to the development overhead and technical debt involved in transforming the existing pull-based mechanism. + +The technical limitations of the underlying queueing service also play a role in the design decisions. SQS, for instance, does not provide an API for push-based message reception, which necessitates the use of a pull-based approach. Additionally, past attempts to use RabbitMQ for push-based message handling revealed significant stability issues, such as connection losses and inconsistent message processing. These challenges have further reinforced the decision to rely on the pull-based model, which offers greater reliability and predictability. + +We can consider converting a push-based reception to messages into our pull-based interfaces. However, this implementation requires to be as stable as the current one. Pull requests are welcome to improve the implementation of our queue system adaptors. + +# Conclusion + +The architecture reflects a deliberate balance between design complexity, operational requirements, and system constraints. By combining a historical interface with strict control over message processing and lifecycle management, ArmoniK achieves a reliable and scalable system. Although the system does not adopt a fully event-driven approach, it remains robust and resilient, meeting the demands of long-running tasks and orchestration challenges. The trade-offs in this design are justified by the system’s operational stability and its ability to handle large-scale workloads efficiently. + +# Copyright + +This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive From e0e31f42662d283cd0f4263f434c3cbfe5e75677 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Gurhem?= Date: Wed, 29 Jan 2025 13:10:32 +0100 Subject: [PATCH 2/4] refactor: rewrite AEP to explain our design choices and the current implementation --- AEP/aep-00007.md | 155 ++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 152 insertions(+), 3 deletions(-) diff --git a/AEP/aep-00007.md b/AEP/aep-00007.md index 66e1c90..e9afef3 100644 --- a/AEP/aep-00007.md +++ b/AEP/aep-00007.md @@ -19,11 +19,38 @@ The main motivation for the current architecture is to maintain a historical int # Rationale -## Strict Control of Message Processing +## Functional Requirements -The system is designed to enforce strict control over the number of messages being processed and their lifecycle. Each message in the system represents an executable task with a well-defined processing flow. An executable task is a task whose dependencies were completed, thus being queued for execution. When a message is received, the corresponding task begins execution. The message is acknowledged only after the task completes, regardless of whether it succeeds or fails. In cases where the queueing service loses its connection with the agent before the message is acknowledged, the system ensures reliability by redelivering the message. This guarantees that tasks are not lost and are executed even when errors or interruptions occur. +To ensure optimal message handling within ArmoniK, strict control over the number of messages being processed simultaneously is required. +Each message represents a task that can be executed, necessitating fine-grained lifecycle management. -Retries for tasks are handled by treating them as new tasks. Each retry is introduced as a new message, which simplifies tracking and execution. Messages are acknowledged when the task completes successfully and subtasks are submitted. ArmoniK processes new messages only after the current task has been dispatched to a worker, starting the acquisition of a new task during the processing of the previous one. This design allows to have a task ready for execution as soon as the previous task ends. +When a message is received, the associated task begins processing immediately. +The message is acknowledged only upon the successful or unsuccessful completion of the task. +It the task completes successfully, the message is acknowledged after tasks results are uploaded to the storage and tasks created by the task in processing are submitted. +If the queue service loses connection with the processing agent, the message is redelivered, ensuring task execution even in the presence of errors. + +Additionally, tasks that are pending can be released when a long-running task is in progress. +ArmoniK initiates the processing of a new message only when the current task has been dispatched to a worker, starting the acquisition of a new task during the processing of the previous one. +This design allows to have a task ready for execution as soon as the previous task ends. +Tasks undergoing retry are treated as entirely new tasks, which simplifies tracking and execution, generating new messages accordingly. + +Message uniqueness is not required as it is managed elsewhere. +Furthermore, message scheduling must be handled within the queue service to accommodate prioritization mechanisms. +The system should also allow for the seamless integration of new plugins to enhance flexibility and adaptability. + +## Possible Approaches for the Interface + +Several approaches can be considered for implementing the message processing interface, including event-driven and pull-based mechanisms. Given the need to control pipelining—managing the number of concurrently processed messages and determining when processing begins—both approaches are functionally equivalent. In practice, polling would be used on an internal queue to regulate message processing. + +Below is an overview of how major market solutions implement message retrieval mechanisms: + +| Solution | Approche | +| ------------- | -------- | +| AWS SQS | pull | +| ActiveMQ | both | +| RabbitMQ | both | +| Google PubSub | both | +| Pulsar | both | ## Trade-offs and Constraints @@ -33,6 +60,128 @@ The technical limitations of the underlying queueing service also play a role in We can consider converting a push-based reception to messages into our pull-based interfaces. However, this implementation requires to be as stable as the current one. Pull requests are welcome to improve the implementation of our queue system adaptors. +## Current Interface + +Queue interfaces can be found in our dotnet package [ArmoniK.Core](https://www.nuget.org/packages/ArmoniK.Core.Base). +It contains everything needed to implement queue plugins that can be dynamically loaded by ArmoniK.Core allowing users to implement plugins that match their requirements. + +```csharp +/// +/// Interface to retrieve messages from the queue +/// +public interface IPullQueueStorage : IQueueStorage +{ + /// + /// Gets messages from the queue + /// + /// Number of messages to retrieve + /// Token used to cancel the execution of the method + /// + /// Enumerator allowing async iteration over the message queue + /// + IAsyncEnumerable PullMessagesAsync(int nbMessages, + CancellationToken cancellationToken = default); +} +``` + +```csharp +/// +/// Interface to handle queue messages lifecycle. +/// +public interface IQueueMessageHandler : IAsyncDisposable +{ + /// + /// Used to signal that the message ownership has been lost + /// + [Obsolete("ArmoniK now manages loss of link with the queue")] + CancellationToken CancellationToken { get; set; } + + /// + /// Id of the message + /// + string MessageId { get; } + + /// + /// Task Id contained in the message + /// + string TaskId { get; } + + /// + /// Status of the message. Used when the handler is disposed to notify the queue. + /// + QueueMessageStatus Status { get; set; } + + /// + /// Date of reception of the message + /// + DateTime ReceptionDateTime { get; init; } +} +``` + +```csharp +/// +/// Represents the status of a queue message +/// +public enum QueueMessageStatus +{ + /// + /// Message is waiting for being processed. + /// + Waiting, + + /// + /// Message processing has failed. The message should be put back at the begin of the queue. + /// + Failed, + + /// + /// The message is being processed. + /// + Running, + + /// + /// Task is not ready to be processed. The message should be put at the end of the queue. + /// + Postponed, + + /// + /// The message has been processed. It can safely be removed from the queue. + /// + Processed, + + /// + /// The message processing has been cancelled. the message can safely be removed from the queue. + /// + Cancelled, + + /// + /// Message has been retried too many times and is considered as poisonous for the queue + /// + Poisonous, +} +``` + +```csharp +/// +/// Interface to insert messages into the queue +/// +public interface IPushQueueStorage : IQueueStorage +{ + /// + /// Puts messages into the queue, handles priorities of messages + /// + /// Collection of messages + /// Id of the partition + /// Token used to cancel the execution of the method + /// + /// Task representing the asynchronous execution of the method + /// + public Task PushMessagesAsync(IEnumerable messages, + string partitionId, + CancellationToken cancellationToken = default); +} +``` + # Conclusion The architecture reflects a deliberate balance between design complexity, operational requirements, and system constraints. By combining a historical interface with strict control over message processing and lifecycle management, ArmoniK achieves a reliable and scalable system. Although the system does not adopt a fully event-driven approach, it remains robust and resilient, meeting the demands of long-running tasks and orchestration challenges. The trade-offs in this design are justified by the system’s operational stability and its ability to handle large-scale workloads efficiently. From 55fd4592092bea086d7b50d89f9fd297c8a947b5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Gurhem?= Date: Mon, 3 Feb 2025 11:45:09 +0100 Subject: [PATCH 3/4] docs: add more informations about queue interfaces working --- AEP/aep-00007.md | 34 ++++++++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+) diff --git a/AEP/aep-00007.md b/AEP/aep-00007.md index e9afef3..253abe6 100644 --- a/AEP/aep-00007.md +++ b/AEP/aep-00007.md @@ -65,6 +65,11 @@ We can consider converting a push-based reception to messages into our pull-base Queue interfaces can be found in our dotnet package [ArmoniK.Core](https://www.nuget.org/packages/ArmoniK.Core.Base). It contains everything needed to implement queue plugins that can be dynamically loaded by ArmoniK.Core allowing users to implement plugins that match their requirements. +An agent from a partition calls the `PullMessagesAsync` to get messages representing tasks from the associated partition. +The partition is given outside the `IPullQueueStorage` and should be passed to the implementation through other means. +It is usually done with .Net options system and environment variables. +This also means there is no uniformization to set up the partition and it will depend on the implementation of the interface. + ```csharp /// /// Interface to retrieve messages from the queue @@ -84,6 +89,11 @@ public interface IPullQueueStorage : IQueueStorage } ``` +The `PullMessagesAsync` method returns an `IAsyncEnumerable` where the `IQueueMessageHandler` is the following interface. +It represents the lifecycle of a message. +The `QueueMessageStatus` is set by ArmoniK during the execution of the task. +Then, the `DisposeAsync` method inherited from `IAsyncDisposable` is used to process the message from the queue by acknowledging it, not acknowledging it, or requeuing it depending on the status. + ```csharp /// /// Interface to handle queue messages lifecycle. @@ -161,6 +171,12 @@ public enum QueueMessageStatus } ``` +Push interface is a lot simpler. +For a given partition, ArmoniK gives an `IEnumerable` where each `MessageData` represents a task. +It contains the task identifier, the session identifier and the options of the task. +The options has a field for the priority of the task. +The queue system and the implementation of the interfaces are then responsible to distribute tasks. + ```csharp /// /// Interface to insert messages into the queue @@ -182,6 +198,24 @@ public interface IPushQueueStorage : IQueueStorage } ``` +```csharp +/// +/// Data structure to hold message data +/// +/// Unique identifier of the task +/// Unique name of the session to which this message belongs +/// Task options +public record MessageData(string TaskId, + string SessionId, + TaskOptions Options); +``` + +## Issues with the current interface + +There are a few issues from the current interfaces: +- The split of the interface into two due to the configuration. The partition should be given in the pull method directly and uniformize partition selection. +- Clarify message processing instead of relying on the `DisposeAsync` method from the `IQueueMessageHandler`. + # Conclusion The architecture reflects a deliberate balance between design complexity, operational requirements, and system constraints. By combining a historical interface with strict control over message processing and lifecycle management, ArmoniK achieves a reliable and scalable system. Although the system does not adopt a fully event-driven approach, it remains robust and resilient, meeting the demands of long-running tasks and orchestration challenges. The trade-offs in this design are justified by the system’s operational stability and its ability to handle large-scale workloads efficiently. From ed8dd339e4aa87be96c2ed42127352fe18f9a6fa Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Gurhem?= Date: Tue, 4 Feb 2025 10:35:02 +0100 Subject: [PATCH 4/4] Take comments into account --- AEP/aep-00007.md | 31 +++++++++++++++++-------------- 1 file changed, 17 insertions(+), 14 deletions(-) diff --git a/AEP/aep-00007.md b/AEP/aep-00007.md index 253abe6..73ef3b6 100644 --- a/AEP/aep-00007.md +++ b/AEP/aep-00007.md @@ -40,11 +40,11 @@ The system should also allow for the seamless integration of new plugins to enha ## Possible Approaches for the Interface -Several approaches can be considered for implementing the message processing interface, including event-driven and pull-based mechanisms. Given the need to control pipelining—managing the number of concurrently processed messages and determining when processing begins—both approaches are functionally equivalent. In practice, polling would be used on an internal queue to regulate message processing. +Several approaches can be considered to implement the message processing interface, including event-driven and pull-based mechanisms. Given the need to control pipelining in order to manage the number of concurrently processed messages and determining when processing begins, both approaches are functionally equivalent. In practice, polling would be used on an internal queue to regulate message processing. Below is an overview of how major market solutions implement message retrieval mechanisms: -| Solution | Approche | +| Solution | Approach | | ------------- | -------- | | AWS SQS | pull | | ActiveMQ | both | @@ -54,20 +54,20 @@ Below is an overview of how major market solutions implement message retrieval m ## Trade-offs and Constraints -Rewriting the codebase to adopt a fully event-driven, push-based pattern is not feasible given the complexity and scale of such a transformation. The current design prioritizes simplicity and robustness over a pure event-driven model. Considering the long duration of tasks and the cost of orchestration, some inefficiencies during message reception are acceptable. These inefficiencies are preferable to the development overhead and technical debt involved in transforming the existing pull-based mechanism. + The current design prioritizes simplicity and robustness over a pure event-driven model. Considering the long duration of tasks and the cost of orchestration, some inefficiencies during message reception are acceptable. These inefficiencies are preferable to the development overhead and technical debt involved in transforming the existing pull-based mechanism. Rewriting the codebase to adopt a fully event-driven, push-based pattern is not feasible given the complexity and scale of such a transformation. -The technical limitations of the underlying queueing service also play a role in the design decisions. SQS, for instance, does not provide an API for push-based message reception, which necessitates the use of a pull-based approach. Additionally, past attempts to use RabbitMQ for push-based message handling revealed significant stability issues, such as connection losses and inconsistent message processing. These challenges have further reinforced the decision to rely on the pull-based model, which offers greater reliability and predictability. +The technical limitations of the underlying queueing service also play an important role in the design decisions. SQS, for instance, does not provide an API for push-based message reception, which necessitates to use of a pull-based approach. Additionally, past attempts to use RabbitMQ for push-based message handling revealed significant stability issues, such as connection losses and inconsistent message processing. These challenges have further reinforced the decision to rely on the pull-based model, which offers greater reliability and predictability. -We can consider converting a push-based reception to messages into our pull-based interfaces. However, this implementation requires to be as stable as the current one. Pull requests are welcome to improve the implementation of our queue system adaptors. +We can consider converting a push-based reception to messages into our pull-based interfaces. Naturally, this implementation requires to be proved as stable and as scalable as the current one. Pull requests are welcome to improve the implementation of our queue system adaptors. ## Current Interface Queue interfaces can be found in our dotnet package [ArmoniK.Core](https://www.nuget.org/packages/ArmoniK.Core.Base). -It contains everything needed to implement queue plugins that can be dynamically loaded by ArmoniK.Core allowing users to implement plugins that match their requirements. +The package provides the elements necessary to implement queue plugins that can be dynamically loaded by ArmoniK.Core allowing users to implement plugins that match their requirements. -An agent from a partition calls the `PullMessagesAsync` to get messages representing tasks from the associated partition. -The partition is given outside the `IPullQueueStorage` and should be passed to the implementation through other means. -It is usually done with .Net options system and environment variables. +An agent from a partition calls the `PullMessagesAsync` method to get messages representing tasks from the associated partition. +The partition is given outside of the `IPullQueueStorage` interface and should be passed to the implementation through other means. +It is usually done with `.Net` options system and environment variables. This also means there is no uniformization to set up the partition and it will depend on the implementation of the interface. ```csharp @@ -89,10 +89,7 @@ public interface IPullQueueStorage : IQueueStorage } ``` -The `PullMessagesAsync` method returns an `IAsyncEnumerable` where the `IQueueMessageHandler` is the following interface. -It represents the lifecycle of a message. -The `QueueMessageStatus` is set by ArmoniK during the execution of the task. -Then, the `DisposeAsync` method inherited from `IAsyncDisposable` is used to process the message from the queue by acknowledging it, not acknowledging it, or requeuing it depending on the status. +The `PullMessagesAsync` method returns an `IAsyncEnumerable` where `IQueueMessageHandler` is an interface representing the lifecycle of a message. ```csharp /// @@ -128,6 +125,10 @@ public interface IQueueMessageHandler : IAsyncDisposable } ``` +The `QueueMessageStatus` is set by ArmoniK during the execution of the task. +Then, the `DisposeAsync` method inherited from `IAsyncDisposable` is used to process the message from the queue by acknowledging it, not acknowledging it, or requeuing it depending on the status. +The statuses are defined as follow: + ```csharp /// /// Represents the status of a queue message @@ -171,7 +172,7 @@ public enum QueueMessageStatus } ``` -Push interface is a lot simpler. +Interface to insert tasks into queue is simpler. For a given partition, ArmoniK gives an `IEnumerable` where each `MessageData` represents a task. It contains the task identifier, the session identifier and the options of the task. The options has a field for the priority of the task. @@ -216,6 +217,8 @@ There are a few issues from the current interfaces: - The split of the interface into two due to the configuration. The partition should be given in the pull method directly and uniformize partition selection. - Clarify message processing instead of relying on the `DisposeAsync` method from the `IQueueMessageHandler`. +These issues will be addressed in a new version of these interfaces. + # Conclusion The architecture reflects a deliberate balance between design complexity, operational requirements, and system constraints. By combining a historical interface with strict control over message processing and lifecycle management, ArmoniK achieves a reliable and scalable system. Although the system does not adopt a fully event-driven approach, it remains robust and resilient, meeting the demands of long-running tasks and orchestration challenges. The trade-offs in this design are justified by the system’s operational stability and its ability to handle large-scale workloads efficiently.