Skip to content

Txm traces#521

Merged
GabrielMartinezRodriguez merged 6 commits intomasterfrom
gabriel/txm-traces
Apr 13, 2025
Merged

Txm traces#521
GabrielMartinezRodriguez merged 6 commits intomasterfrom
gabriel/txm-traces

Conversation

@GabrielMartinezRodriguez
Copy link
Contributor

@GabrielMartinezRodriguez GabrielMartinezRodriguez commented Mar 13, 2025

Description

This PR includes a proof of concept on how to implement traces in the transaction manager. Traces are useful because they allow us to answer questions such as:

What happened in block X that caused the absence of a drand value in that block?
Why is a transaction in the "interrupted" status?
Why are we not including transactions fast enough to reveal randomness?
Metrics, combined with alerts, provide a generic overview that helps us understand if everything is working correctly, while traces offer more concrete data to understand why something is happening. This allows us to debug and fix production issues much faster.

To simplify the implementation, I created a special decorator that allows us to propagate traces without having to use the startActiveSpan method. This method is somewhat clunky because it requires implementing the span code inside the span itself to propagate the async local storage, which is how OpenTelemetry establishes the hierarchical relationship between multiple spans.

Since we are using OpenTelemetry, we can leverage this feature:
Grafana Exemplars

This feature allows us to correlate traces with metrics. For example, if we notice in Grafana that some transactions are taking longer than expected, we can directly jump from Grafana to the specific trace where the delay occurred and quickly understand the root cause.

To visualize the traces, I deployed Tempo locally. It works well and is natively integrated with Grafana, making it the best option for hosting traces.

  • Include all relevant context (but no need to repeat the issue's content).
  • Draw attention to new, noteworthy & unintuitive elements.
Toggle Checklist

Checklist

Basics

  • B1. I have applied the proper label & proper branch name (e.g. norswap/build-system-caching).
  • B2. This PR is not so big that it should be split & addresses only one concern.
  • B3. The PR targets the lowest branch it can (ideally master).

Reminder: PR review guidelines

Correctness

  • C1. Builds and passes tests.
  • C2. The code is properly parameterized & compatible with different environments (e.g. local,
    testnet, mainnet, standalone wallet, ...).
  • C3. I have manually tested my changes & connected features.

< INDICATE BROWSER, DEMO APP & OTHER ENV DETAILS USED FOR TESTING HERE >

< INDICATE TESTED SCENARIOS (USER INTERFACE INTERACTION, CODE FLOWS) HERE >

  • C4. I have performed a thorough self-review of my code after submitting the PR,
    and have updated the code & comments accordingly.

Architecture & Documentation

  • D1. I made it easy to reason locally about the code, by (1) using proper abstraction boundaries,
    (2) commenting these boundaries correctly, (3) adding inline comments for context when needed.
  • D2. All public-facing APIs & meaningful (non-local) internal APIs are properly documented in code
    comments.
  • D3. If appropriate, the general architecture of the code is documented in a code comment or
    in a Markdown document.
  • D4. An appropriate Changeset has been generated (and committed) for changes that touch npm published packages (currently pacakges/core and packages/react), see here for more info.

@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Mar 13, 2025

Deploying happychain with  Cloudflare Pages  Cloudflare Pages

Latest commit: b9091a6
Status: ✅  Deploy successful!
Preview URL: https://6b2f72e5.happychain.pages.dev
Branch Preview URL: https://gabriel-txm-traces.happychain.pages.dev

View logs

@GabrielMartinezRodriguez GabrielMartinezRodriguez marked this pull request as ready for review March 13, 2025 13:53
@GabrielMartinezRodriguez GabrielMartinezRodriguez changed the title feat(txm): traces PoC: Txm traces Mar 13, 2025
@GabrielMartinezRodriguez GabrielMartinezRodriguez added the no-merge For showcase, not to be merged label Mar 13, 2025
Copy link
Collaborator

@norswap norswap Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with opentelemetry at all, but what's the purposes of spans vs events. Is it useful to group events emitted in a method in spans (or maybe we have to?) — vs just having the events in a single top-level span?

I think I remember that spans could be used for stuff that happens on different services (different proceses or servers), where it would make sense to have one span per service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A span inside a trace represents a process within the trace. Each span can have attributes and events. I think the right approach is to have a span for every method, because it's the easiest way, and it allows us to clearly see the stack trace followed by a transaction.

Every horizontal line is a stack, and you can click on it to view its events

image

Copy link
Collaborator

@norswap norswap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems great, left a question!

@norswap norswap added the question Something has to be cleared up after review label Mar 14, 2025
@GabrielMartinezRodriguez GabrielMartinezRodriguez force-pushed the gabriel/txm-telemetry branch 2 times, most recently from ddfe09e to cf5bf43 Compare March 17, 2025 16:03
This was referenced Mar 26, 2025
@GabrielMartinezRodriguez GabrielMartinezRodriguez changed the base branch from gabriel/txm-telemetry to gabriel/fix-block-undefined March 31, 2025 11:55
@GabrielMartinezRodriguez GabrielMartinezRodriguez changed the base branch from gabriel/fix-block-undefined to gabriel/fix-returned-nonce-order April 1, 2025 10:01
@GabrielMartinezRodriguez GabrielMartinezRodriguez force-pushed the gabriel/fix-returned-nonce-order branch from fc2dc4f to 161217c Compare April 10, 2025 08:38
@GabrielMartinezRodriguez GabrielMartinezRodriguez force-pushed the gabriel/fix-returned-nonce-order branch from 161217c to be24f08 Compare April 10, 2025 08:51
@GabrielMartinezRodriguez GabrielMartinezRodriguez force-pushed the gabriel/fix-returned-nonce-order branch from be24f08 to 81ed24a Compare April 10, 2025 09:13
@GabrielMartinezRodriguez GabrielMartinezRodriguez force-pushed the gabriel/fix-returned-nonce-order branch from 81ed24a to 694721d Compare April 10, 2025 09:56
@GabrielMartinezRodriguez GabrielMartinezRodriguez force-pushed the gabriel/fix-returned-nonce-order branch from 694721d to 85f1522 Compare April 10, 2025 11:51
@GabrielMartinezRodriguez GabrielMartinezRodriguez force-pushed the gabriel/fix-returned-nonce-order branch from 85f1522 to fe19633 Compare April 10, 2025 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-blocked Ready to merge, waiting for downstack

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants