-
Notifications
You must be signed in to change notification settings - Fork 0
feat: add extension spec for task processing #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
specs/TaskProcessingExt.tla
Outdated
| TASK_CANCELED, | ||
| TASK_PAUSED | ||
| }] | ||
| /\ nextAttemptOf \in [TaskId -> TaskId \union {NULL}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does TLA+ support partial functions? If so, wouldn't it be better to make nextAttemptOf a partial function? This would avoid having to define a NULL constant.
Definitions
- Partial Function: A function ( f: X \to Y ) is partial if for some elements ( x \in X ), ( f(x) ) is undefined. In other words, ( f ) does not necessarily have an output for every input in its domain ( X ).
- Total Function: A function ( f: X \to Y ) is total if for every ( x \in X ), ( f(x) ) is defined. That is, ( f ) assigns exactly one output in ( Y ) to every input in ( X ).
specs/TaskProcessingExt.tla
Outdated
| (** | ||
| * RETRYABLE TASK REGISTRATION | ||
| * A set of tasks 'T' that have not yet been retried are cloned by a set of | ||
| * tasks 'U' (each task in 'T' being associated with a single task in 'U' by | ||
| * the bijection 'f') which are registered to allow the re-execution of the same | ||
| * computations as those attempted by the tasks of 'T'. | ||
| *) | ||
| \* Remove atomic double update | ||
| RetryTasks(T, U) == | ||
| LET | ||
| f == CHOOSE x \in Bijection(T, U) : TRUE | ||
| IN | ||
| /\ T /= {} /\ T \subseteq UnretriedTask /\ U \subseteq UnknownTask | ||
| /\ Cardinality(T) = Cardinality(U) | ||
| /\ taskState' = | ||
| [u \in TaskId |-> IF u \in U THEN TASK_REGISTERED ELSE taskState[u]] | ||
| /\ nextAttemptOf' = | ||
| [t \in TaskId |-> IF t \in T THEN f[t] ELSE nextAttemptOf[t]] | ||
| /\ UNCHANGED agentTaskAlloc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This action describes an atomic transition of the system in which:
- a new task is registered in the system
- this newly created task is recorded as a new attempt of a task that previously failed.
In the current implementation, these two operations are implemented by two separate requests on two separate documents (in MongoDB). Performing these two operations in a single atomic operation would require a transaction that is not necessarily supported by database systems. It therefore seems more reasonable to split this action into two actions: one for each step.
Proposed corrections:
The action must be “split” into two. There are two possible options depending on the order in which the two operations are performed. Since the RegisterTasks action already allows new tasks to be registered, this operator simply needs to be modified to implement step (2). A fairness condition must also be added to ensure that both actions always take place (if the first occurs, the second must also occur). The order determines which operator the fairness condition will apply to (RegisterTasks or RetryTasks).
A priori, there does not seem to be one order that is more relevant than another.
Whichever option is chosen, it will lead to a modification of the action constraint (ActionConstraint) used during model checking to avoid improper violations of liveness properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice, I have the feeling that RegisterTasks should be done before RetryTasks as you would need to know about the task to mark nextAttempt ?
I have the feeling that this would be required to ensure that the data can be produced.
Shouldn't the status of t be changed after this transition ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the order you're right. It seems more natural to first register the task before updating nextAttemptOf.
For the status update it is done by another action (a FinalizeTasks action) and the fairness condition on that action guarantees that it will eventually happen.
| * valid. | ||
| *) | ||
| \* registered task can also be canceled | ||
| CancelTasks(T) == |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This action presents two problems:
- It prohibits the cancellation of a task in the REGISTERED state. However, this may well be requested by the user. It is therefore necessary to allow such a transition in the system.
- In the current implementation, canceling a task consists of a database query to record the cancellation request; then, if the task is being processed, the agent responsible for the task is notified to interrupt its processing. The proposed modeling confuses these two steps by describing a different process depending on whether the task is assigned to an agent or not. This must therefore be changed.
Proposed correction:
- Modify the enabling condition of the action.
- Modify this action so that it describes the first step (record the cancellation request), then modify the ProcessTasks action so that it describes the second step (interrupting task processing when the agent receives the notification).
Note: For correction (2), two options are considered:
- In the first step, all tasks are marked as CANCELING, then they continue the normal processing chain and change to CANCELED when processed by an agent.
- In the first step, tasks that are not currently being processed are marked as CANCELED, while those currently being processed are marked as CANCELING and then become CANCELED by the ProcessTasks action later.
This choice is important because it will determine how cascading task cancellation in graphs will be described in future refinements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After further consideration, it seems inappropriate to use a ‘CANCELING’ status to mark tasks for which cancellation has been requested, as this status would overwrite information about where the task is in the processing workflow. A more appropriate modeling choice would therefore be to introduce a new state variable that would indicate for each task whether cancellation has been requested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed by commit: 9ea740d9730ab64ba2ac5dfbc1c81a59843e55f9
According to the above, the cancelRequested state variable indicates for each task whether cancellation has been requested. The CancelTasks action updates this variable and changes the status of registered, paused, and some staged tasks to CANCELED. It is reasonable to assume that some staged tasks cannot be canceled immediately, as per the current implementation: a staged task is associated with a message in the queue, and cancellation is only realized when the task is processed by an agent. Hence this design choice.
Next, a new action, AcknowledgeTasksCancellation, models the effective cancellation of one or more tasks by an agent. A subtle case then arises: an assigned task whose cancellation is requested may be released, in which case it will be reacquired later and must be canceled at that time. Indeed, upon reacquisition, the agent will see the cancellation request and act accordingly. To model this behavior, the processing of a task by an agent was detailed by introducing the STARTED state, designating a task whose execution has begun. For the previously described case, the cancellation operation is therefore performed when the task is assigned but not yet started.
To be complete, the AcknowledgeTasksCancellation action can occur either:
- Before a task begins, if it was assigned after the cancellation request, or
During task execution, if the cancellation was requested after the task started execution.
specs/TaskProcessingExt.tla
Outdated
| (** | ||
| * TASK PAUSING | ||
| * A set 'T' of tasks is paused, meaning that the execution of the tasks is | ||
| * postponed (until they are resumed). This action only applies to staged tasks. | ||
| *) | ||
| \* assigned tasks can be paused | ||
| PauseTasks(T) == | ||
| /\ T /= {} /\ T \subseteq StagedTask | ||
| /\ taskState' = | ||
| [t \in TaskId |-> IF t \in T THEN TASK_PAUSED ELSE taskState[t]] | ||
| /\ UNCHANGED << agentTaskAlloc, nextAttemptOf >> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far the ASSIGNED status remains somewhat vague. Assigning a task to an agent does not necessarily mean that it is currently being executed. In fact, the task may be in pre/post-processing, for example. If it is not possible to pause a task once it has started running, it is possible to pause a task in pre-processing by an agent. An ASSIGNED task can therefore be paused.
Proposed correction: Given that the ASSIGNED status is vague (abstract) in this specification, it is sufficient to modify the enabling condition fo the action to allow pausing. Future refinements will need to detail the conditions for pausing an assigned task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It ultimately appears that simply modifying the enabling condition is not sufficient. Indeed, as in the case of cancellation, pausing is done in two steps: request and then acknowledgment. The modeling must therefore be based on what has been done for task cancellations. In addition, introducing the STARTED status removes the ambiguity mentioned above. A task can only be paused if its execution has not started, which is indicated by the STARTED status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed by commit: 98d5a32de696cf53c7a79516cd1866fbf4f7364c
As for cancellation, the pausingRequested state variable indicates for each task whether pausing has been requested. The PauseTasks action updates this variable and changes the status of some staged tasks to PAUSED (for the same Reason s for cancellation, not all staged tasks are paused immediatly). The reamining tasks are then paused by an agent through the AcknowledgeTasksPausing action. This action can also pause tasks for which the pausing request arrived when the agent wa already assigned to the agent but not yet started.
The combination of pause and resume actions can lead to undesirable behavior. To overcome this problem, two constraints have been added:
- A canceled task cannot be paused (this seems fairly natural).
- A user cannot pause/resume a task an infinite number of times (this constraint/modeling assumption is described by the
Livenessoperator).
specs/TaskProcessingExt.tla
Outdated
| \union AbortedTask | ||
| \union CanceledTask | ||
| \* Is this condition really needed | ||
| /\ UnretriedTask = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this conjunct really needed as the fairness conditions should prohibit behaviors where the system stutters on a state where certain tasks have not been retried.
| *) | ||
| TypeInv == | ||
| /\ agentTaskAlloc \in [AgentId -> SUBSET TaskId] | ||
| /\ taskState \in [TaskId -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment explaining the meaning of all these statuses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These states are defined and documented in TaskStates.tla.
| /\ nextAttemptOf = [t \in TaskId |-> NULL] | ||
|
|
||
| (** | ||
| * TASK STAGING |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * TASK STAGING | |
| * TASK REGISTRATION |
specs/TaskProcessingExt.tla
Outdated
| (** | ||
| * RETRYABLE TASK REGISTRATION | ||
| * A set of tasks 'T' that have not yet been retried are cloned by a set of | ||
| * tasks 'U' (each task in 'T' being associated with a single task in 'U' by | ||
| * the bijection 'f') which are registered to allow the re-execution of the same | ||
| * computations as those attempted by the tasks of 'T'. | ||
| *) | ||
| \* Remove atomic double update | ||
| RetryTasks(T, U) == | ||
| LET | ||
| f == CHOOSE x \in Bijection(T, U) : TRUE | ||
| IN | ||
| /\ T /= {} /\ T \subseteq UnretriedTask /\ U \subseteq UnknownTask | ||
| /\ Cardinality(T) = Cardinality(U) | ||
| /\ taskState' = | ||
| [u \in TaskId |-> IF u \in U THEN TASK_REGISTERED ELSE taskState[u]] | ||
| /\ nextAttemptOf' = | ||
| [t \in TaskId |-> IF t \in T THEN f[t] ELSE nextAttemptOf[t]] | ||
| /\ UNCHANGED agentTaskAlloc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice, I have the feeling that RegisterTasks should be done before RetryTasks as you would need to know about the task to mark nextAttempt ?
I have the feeling that this would be required to ensure that the data can be produced.
Shouldn't the status of t be changed after this transition ?
specs/TaskProcessingExt.tla
Outdated
| /\ T /= {} /\ T \subseteq UnretriedTask /\ U \subseteq UnknownTask | ||
| /\ Cardinality(T) = Cardinality(U) | ||
| /\ taskState' = | ||
| [u \in TaskId |-> IF u \in U THEN TASK_REGISTERED ELSE taskState[u]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is'nt the following missing ?
There is no v in U such as nextAttemptOf[v]=t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link between T and U is defined by bijection f define above.
specs/TaskProcessingExt.tla
Outdated
| (** | ||
| * LIVENESS | ||
| * A failed task is eventually retried. | ||
| *) | ||
| FailedTaskEventualRetry == | ||
| \A t \in TaskId: | ||
| t \in UnretriedTask ~> | ||
| nextAttemptOf[t] \in StagedTask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A task represents an execution instance of a computation. The information contained in a task makes it possible to identify the computation as well as the state of its execution. In the TaskProcessing specification, a task is abstracted by a simple ID.
The notion of retry for a task means that a computation that has failed must be able to be relaunched if necessary (for example, if the data produced by the computation has not been produced elsewhere). The variable nextAttemptOf materializes a link between two tasks. However, it is not this link itself that is important, but rather the fact that the second task is a new instance of the same computation, which can be relaunched following the failure of the first task.
In practice, this link does exist in the current implementation (a field is dedicated to it in task documents in the database), but it is used for observability purposes (monitoring and debugging).
In summary, the underlying concept illustrated by nextAttemptOf is the idea that a second instance of a failed computation must be created and become ready to be executed (that is, be in the STAGED state).
Proposed Correction:
To clarify the above, we can replace the variable nextAttemptOf with taskComputation, which indicates, for each task, the computation of which it is an instance (modeled by an abstract ID). The previous property would then become:
FailedTaskEventualRetry ==
\A t \in TaskId:
t \in FailedTask ~>
\E u \in TaskId :
u \in StagedTask /\ taskComputation[u] = taskComputation[t]This new version of the property better expresses what is expected from the system, without making assumptions about how it is implemented.
| * valid. | ||
| *) | ||
| \* registered task can also be canceled | ||
| CancelTasks(T) == |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After further consideration, it seems inappropriate to use a ‘CANCELING’ status to mark tasks for which cancellation has been requested, as this status would overwrite information about where the task is in the processing workflow. A more appropriate modeling choice would therefore be to introduce a new state variable that would indicate for each task whether cancellation has been requested.
485bbac to
9c684d6
Compare
9c684d6 to
9ea740d
Compare
specs/TaskProcessingExt.tla
Outdated
| \/ AssignTasks(a, T) | ||
| \/ StartTasks(a, T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those transitions (Assign and Start) should be disabled if cancellation or pause is requested.
Instead, cancellation acknowledgment should be a transition from Assigned or Staged when the cancellation is requested.
specs/TaskProcessingExt.tla
Outdated
| \/ AcknowledgeTasksCancellation(a, T) | ||
| \/ AcknowledgeTasksPausing(a, T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of CancelTask -> AcknowledgeCancellation, it seems to be more consistent with your state name to have RequestTaskCancellation -> CancelTask
specs/TaskProcessingExt.tla
Outdated
| /\ \E ExemptStagedTasks \in SUBSET (T \intersect StagedTask): | ||
| taskState' = | ||
| [t \in TaskId |-> IF t \in T /\ t \notin (AssignedTask \union ExemptStagedTasks) | ||
| THEN TASK_CANCELED | ||
| ELSE taskState[t]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed. We do not need to force atomicity between the request, and the state change.
| /\ cancelRequested \in [TaskId -> BOOLEAN] | ||
| /\ pausingRequested \in [TaskId -> BOOLEAN] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those variables could be sets instead of functions.
specs/TaskProcessingExt.tla
Outdated
| *) | ||
| CancelTasks(T) == | ||
| /\ T /= {} | ||
| /\ T \subseteq UNION {RegisteredTask, StagedTask, PausedTask, AssignedTask} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this constraint and deal with other status when the cancellation is actually performed.
64f428d to
0409700
Compare
0409700 to
ea9b0fd
Compare
This PR introduces TaskProcessingExt, a refined TLA+ specification that extends the existing TaskProcessing spec. The new specification provides a more detailed task lifecycle, explicitly modeling: