Skip to content

Conversation

joeyzhao2018
Copy link
Contributor

@joeyzhao2018 joeyzhao2018 commented Sep 23, 2025

What does this PR do?

The customer is experiencing a case where multiple lambda invocations are put into one trace even though they have different tracecontext in their payload. This indicates that the rootTraceContext was reused/cached unintentionally.

Motivation

https://datadoghq.atlassian.net/browse/APMS-17080

Testing Guidelines

Additional Notes

Types of Changes

  • Bug fix
  • New feature
  • Breaking change
  • Misc (docs, refactoring, dependency upgrade, etc.)

Check all that apply

  • This PR's description is comprehensive
  • This PR contains breaking changes that are documented in the description
  • This PR introduces new APIs or parameters that are documented and unlikely to change in the foreseeable future
  • This PR impacts documentation, and it has been updated (or a ticket has been logged)
  • This PR's changes are covered by the automated tests
  • This PR collects user input/sensitive content into Datadog
  • This PR passes the integration tests (ask a Datadog member to run the tests)

@joeyzhao2018 joeyzhao2018 requested review from a team as code owners September 23, 2025 16:49
@joeyzhao2018 joeyzhao2018 changed the title reset tracecontext to avoid unintentional caching fix: reset tracecontext to avoid unintentional caching Sep 23, 2025
@lucaspimentel
Copy link
Member

@codex review

Copy link

Codex Review: Didn't find any major issues. More of your lovely PRs please.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

@joeyzhao2018
Copy link
Contributor Author

joeyzhao2018 commented Sep 25, 2025

[HOLD] Not merging yet because this seems to be a broader issue if the fix is correct. So we should reproduce first.
Asked AI about what kind of payloads can cause this and the answer is as following

The question is about when rootTraceContext gets reused after extraction failure. Looking at the recent commit, there was
  a bug where rootTraceContext wasn't being reset between invocations, causing unintentional caching.
Based on the extraction flow, the rootTraceContext would be reused (before the recent fix) in these scenarios:

  Event payloads that would reach XRay fallback extraction:

  1. Plain object events without recognized structure - Events that don't match any of the supported event types (HTTP, SNS,
   SQS, Kinesis, EventBridge, etc.) and have no custom extractor configured.
  2. Events where all extraction methods fail - For example:
    - HTTP events with malformed or missing trace headers
    - SNS/SQS events without proper trace context in message attributes
    - Kinesis events with corrupted or missing trace data
    - Events where the Lambda context itself has no X-Ray trace information
  3. Malformed events - Events that are not objects or have unexpected structure that causes extractors to return null.

  Example event payloads that would reach XRay extraction:
  // Plain custom event with no trace headers
  { "customData": "value", "timestamp": 123456789 }

  // HTTP event with missing/invalid trace headers  
  { "headers": {}, "body": "..." }

 // SNS event with no trace context in message attributes
  { "Records": [{ "Sns": { "Message": "...", "MessageAttributes": {} } }] }

Trying these now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants