Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: de-duplicate operations received in quick succession from c8y/devicecontrol/notifications #3454

Merged
merged 3 commits into from
Mar 25, 2025

Conversation

jarhodes314
Copy link
Contributor

@jarhodes314 jarhodes314 commented Mar 7, 2025

Proposed changes

Fixes the handling of in-progress operations by the Cumulocity converter so that the de-duplication mechanism is applied immediately. Since it takes a small amount of time between triggering the operation and the converter receiving the te/... topic message (which was when the operation was marked as "active" in the converter), there was a race that could lead to duplicate messages from Cumulocity both being processed by the converter.

Specifically, this PR changes the converter to mark the operation as "active" as it is initially handled by the converter. This has two advantages: firstly, this fixes the aforementioned race condition, and it additionally means that legacy custom operations based on smartrest are also de-duplicated. Since the operation doesn't have an associated te/... topic, these "active" operations expire after 12 hours. For workflow-based/built-in operations, the "active" operation will be discarded by the mapper only when it is marked as complete on the relevant te/... topic. The de-duplication will work across mapper restarts (assuming MQTT broker persistence is approriately configured), as it did before.

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
  • Documentation Update (if none of the other choices apply)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

Checklist

  • I have read the CONTRIBUTING doc
  • I have signed the CLA (in all commits with git commit -s. You can activate automatic signing by running just prepare-dev once)
  • I ran cargo fmt as mentioned in CODING_GUIDELINES
  • I used cargo clippy as mentioned in CODING_GUIDELINES
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Further comments

Copy link

codecov bot commented Mar 7, 2025

Codecov Report

Attention: Patch coverage is 96.50350% with 10 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
crates/extensions/c8y_mapper_ext/src/converter.rs 96.35% 1 Missing and 9 partials ⚠️

📢 Thoughts on this report? Let us know!

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

github-actions bot commented Mar 7, 2025

Robot Results

✅ Passed ❌ Failed ⏭️ Skipped Total Pass % ⏱️ Duration
600 0 3 600 100 1h41m21.677650999s

Copy link
Contributor

@albinsuresh albinsuresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue that exists with this solution is a (admittedly likely very slow) memory leak caused by the mapper storing the operation IDs indefinitely. Ideally, we should delete these at some point where we know the operation is no longer marked as pending, though I'm not sure how to do this.

Although clearing this entry when the operation transitions to successful/failed states is good enough for majority of the cases, I understand that it is not a fool-proof solution as there is still the possibility of getting a duplicate message while these state transition messages are still in transit to the cloud (or buffered for processing either locally or on the cloud). But since the terminal state transitions usually happen after the executing transition has already happened(which changes the PENDING status of the op in the cloud), the risk is reduced even further, although not fully eliminated.

I'm in favour of risking a duplicate operation execution in such rare cases(where the duplicate is delivered even after the terminal state transition is published), compared to the risk of that slow memory leak.

@@ -190,6 +190,8 @@ pub struct CumulocityConverter {

supported_operations: SupportedOperations,
pub operation_handler: OperationHandler,

processed_ids: HashSet<String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we not re-use the active_commands set? Or you avoided that because the entries in that set are cleared on the terminal state transition of those operations and you didn't want these entries cleared so soon?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the doc comment for active_commands this sounds indeed as the correct place for that fix.

However, this raises an other point: why do we have this issue with duplicated commands while there is already a mechanism supposed to handle that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think active_commands is a broken solution to the problem. I think there we are de-duplicating c8y/devicecontrol/notifications, but we are only tracking an active command once we receive the relevant tedge-topic command message, which will happen a short while later. I think this is what leaves open the window for an operation to be duplicated.

Assuming I've understood correctly, that would indicate that the problem could simply be solved by moving the active_commands insertion to where I'm currently inserting to processed_ids, and deleting the processed_ids stuff?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming I've understood correctly, that would indicate that the problem could simply be solved by moving the active_commands insertion to where I'm currently inserting to processed_ids, and deleting the processed_ids stuff?

One point is sure: one should keep only a single de-duplication mechanism. What you propose makes sense: it's better to remove duplicates before any processing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have this issue with duplicated commands while there is already a mechanism supposed to handle that?

For the cases where the duplicate messages are delivered after a restart, it is the lack of persistence of this set. But for duplicate messages delivered while the mapper is still live, this should have been sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now modified active_commands to insert immediately post-conversion, rather than when we receive our outgoing message. As a result, I've deleted processed_ids.

@albinsuresh
Copy link
Contributor

we receive an unrecognised custom operation that's later registered

I don't see how this can happen as a device is not expected to receive an operation that it hasn't declared as a supported operation (this is my expectation from C8Y) via the operation capability registration.

@reubenmiller
Copy link
Contributor

reubenmiller commented Mar 10, 2025

we receive an unrecognised custom operation that's later registered

I don't see how this can happen as a device is not expected to receive an operation that it hasn't declared as a supported operation (this is my expectation from C8Y) via the operation capability registration.

Operation can be created via API which don't first look at which operations are supported by the device or not...so in cases of automation, it is not so uncommon for operations to be sent to devices as it is automation will assume that the an unsupported operation will be rejected by the agent (and this is deemed cheaper rather than backend service first checking each device if it supports the intended operation)...this is fairly common in large device fleets (> 200K).

@jarhodes314 jarhodes314 added the theme:c8y Theme: Cumulocity related topics label Mar 10, 2025
@jarhodes314 jarhodes314 temporarily deployed to Test Pull Request March 12, 2025 11:24 — with GitHub Actions Inactive
Comment on lines 2531 to 2562
let original = MqttMessage::new(&Topic::new_unchecked("c8y/devicecontrol/notifications"), json!(
{"delivery":{"log":[],"time":"2025-03-05T08:49:24.986Z","status":"PENDING"},"agentId":"1916574062","creationTime":"2025-03-05T08:49:24.967Z","deviceId":"1916574062","id":"16574089","status":"PENDING","c8y_Restart":{},"description":"do something","externalSource":{"externalId":"test-device","type":"c8y_Serial"}}
).to_string());
let redelivery = MqttMessage::new(&Topic::new_unchecked("c8y/devicecontrol/notifications"), json!(
{"delivery":{"log":[{"time":"2025-03-05T08:49:24.986Z","status":"PENDING"},{"time":"2025-03-05T08:49:25.000Z","status":"SEND"},{"time":"2025-03-05T08:49:25.162Z","status":"DELIVERED"}],"time":"2025-03-05T08:49:25.707Z","status":"PENDING"},"agentId":"1916574062","creationTime":"2025-03-05T08:49:24.967Z","deviceId":"1916574062","id":"16574089","status":"PENDING","c8y_Restart":{},"description":"do something","externalSource":{"externalId":"test-device","type":"c8y_Serial"}}
).to_string());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to make obvious that original and redelivery only differ on the delivery field`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now removed some of the extraneous fields and changed things so we send the original message twice, since the converter doesn't care about the delivery field.

}

#[tokio::test]
async fn custom_operations_are_not_deduplicated_before_registration() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I struggle to understand what is checked by this test and how.

CumulocityConverter.supported_operations field is patched before the first request then restored before the second. Okay, but why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tests is checking what happens if we receive a custom operation that is unrecognised and later get redelivered the custom operation after it is registered with the mapper. If the converter naïvely assumes that the operation is active after we first receive it, the de-duplication mechanism will ignore the redelivery. But since we haven't yet processed this message, we should process such a redelivery. This is obviously dependent on something sending a 500 message once the operation is registered, but that could be the case for some custom operation handling service.

The patching of the supported_operations was intended as an easy way of ensuring the registered operations are made clear, since I'm not trying to test how we update supported_operations in this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is clearer. Might be good to add this response as a comment to the test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this is fine. But seeing the detailed discussion above, it felt like this case could have been better represented as an integration test in tests.rs, where we can better simulate the dynamic custom operation registration during the execution of the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I want the opposite of that, @albinsuresh. The point of this being a unit test is I don't also want to test the operation registration logic at the same time. I've added a comment to explain what it is I'm trying to test.

Copy link
Contributor

@albinsuresh albinsuresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look fine. Some minor suggestions on the tests.

Comment on lines 2576 to 2596
assert_ne!(
converter
.parse_json_custom_operation_topic(&original)
.await
.unwrap(),
vec![],
"Initial operation delivery produces outgoing message"
);
Copy link
Contributor

@albinsuresh albinsuresh Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A slightly stricter check that validates at least the cmd topic would have been better than this non-empty output check. To avoid false positives like a converted error message sent to te/errors instead of the expected cmd beating this assertion.


converter.supported_operations = after_registration;

assert_ne!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above, regarding a stricter check.

}

#[tokio::test]
async fn custom_operations_are_not_deduplicated_before_registration() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this is fine. But seeing the detailed discussion above, it felt like this case could have been better represented as an integration test in tests.rs, where we can better simulate the dynamic custom operation registration during the execution of the test.

Copy link
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two forgotten dbg! to be removed and some questions.

@@ -1399,6 +1410,21 @@ impl CumulocityConverter {
&mut self,
message: &MqttMessage,
) -> Result<Vec<MqttMessage>, ConversionError> {
if dbg!(self.active_commands_last_cleared.elapsed(&*self.clock)) > Duration::from_secs(3600)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a matter of taste: I would prefer an elapsed or elapsed_since method on the clock rather than the instant:

Suggested change
if dbg!(self.active_commands_last_cleared.elapsed(&*self.clock)) > Duration::from_secs(3600)
if self.clock.elapsed_since(&self.active_commands_last_cleared) > Duration::from_secs(3600)

PS: dbg! to be removed

didier-wenzek
didier-wenzek previously approved these changes Mar 20, 2025
Copy link
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. Thank you.

Copy link
Contributor

@albinsuresh albinsuresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look much simpler now. I've gone one concern though, regarding the premature eviction from the cache, before the operation really completes.

@@ -1403,6 +1406,20 @@ impl CumulocityConverter {
&mut self,
message: &MqttMessage,
) -> Result<Vec<MqttMessage>, ConversionError> {
if self.active_commands_last_cleared.elapsed() > Duration::from_secs(3600) {
Copy link
Contributor

@albinsuresh albinsuresh Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually thinking that you'd use the timer actor from the mapper to get timeout notification message at desired intervals. But, it was smart to place the eviction logic here.

Copy link
Contributor

@albinsuresh albinsuresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jarhodes314 jarhodes314 force-pushed the feat/dedup-operations branch from e50c720 to 288ab5e Compare March 24, 2025 14:57
@jarhodes314 jarhodes314 temporarily deployed to Test Pull Request March 24, 2025 14:57 — with GitHub Actions Inactive
@jarhodes314 jarhodes314 changed the title feat: de-duplicate operations from c8y/devicecontrol/notifications fix: de-duplicate operations received in quick succession from c8y/devicecontrol/notifications Mar 24, 2025
Copy link
Contributor

@albinsuresh albinsuresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-confirm my approval.

@jarhodes314 jarhodes314 dismissed didier-wenzek’s stale review March 25, 2025 10:10

Redid time-based expiry logic

Copy link
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved

@jarhodes314 jarhodes314 added this pull request to the merge queue Mar 25, 2025
Merged via the queue into thin-edge:main with commit 3f1b13d Mar 25, 2025
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme:c8y Theme: Cumulocity related topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants