Skip to content

Conversation

@lemaitre-aneo
Copy link
Contributor

Motivation

[Include the reason behind these changes and any relevant context.]

Description

[Provide a detailled explanation of the modifications you have made. Link any related issues.]

Testing

[When applicable, detail the testing you have performed to ensure that these changes function as intended. Include information about any added tests.]

Impact

[Discuss the impact of your modifications on ArmoniK. This might include effects on performance, configuration, documentation, new dependencies, or changes in behaviour.]

Additional Information

[Any additional information that reviewers should be aware of.]

Checklist

  • My code adheres to the coding and style guidelines of the project.
  • I have performed a self-review of my code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • I have thoroughly tested my modifications and added tests when necessary.
  • Tests pass locally and in the CI.
  • I have assessed the performance impact of my modifications.

@lemaitre-aneo lemaitre-aneo force-pushed the fl/intent-log branch 6 times, most recently from 55df31b to 111dd88 Compare February 23, 2025 17:31
@lemaitre-aneo lemaitre-aneo force-pushed the fl/intent-log branch 3 times, most recently from a8c54f3 to 901e8f4 Compare March 4, 2025 21:34
aneojgurhem added a commit that referenced this pull request Sep 19, 2025
# Motivation

We observed cases of tasks that were not properly submitted due to
interruptions or errors during submission. This PR aims to make
submission related RPCs atomic so that, when there is an issue, no tasks
or results are left in an inconsistent state.

# Description

A try/catch block was added around critical parts of the code where the
tasks/resutls are deleted from the database/object storage upon
exception.

# Testing

Unit tests were implemented to check that tasks and results are properly
deleted.

# Impact

- Tasks that are in Pending state due to errors during submission should
be deleted and not appear in monitoring anymore.
- Issues during result creation should not leave partial results created

# Additional Information

- Rollback after task submission by another task is not properly
implemented because owner id in the result is changed without keeping
the old value. It requires a change in the data scheme to keep the old
value and put it back on if needed. This case should be treated at some
point as it is needed to properly rollback sub tasks when task is
reacquired after crash during postprocessing.
- This correction is not completely foolproof as it assumes the service
to be able to execute the catch clause. It may not be executed in case
of sudden crash of the service. Moreover, connection loss to the data
plane may not allow to run the code within the catch properly. More
complete solutions depends on #850
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants