Skip to content

docs: Add telemetry documentation page #731

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

docs: Add telemetry documentation page #731

wants to merge 14 commits into from

Conversation

Techassi
Copy link
Member

@Techassi Techassi self-assigned this Apr 17, 2025
@Techassi Techassi moved this to Development: In Progress in Stackable Engineering Apr 17, 2025
Copy link

netlify bot commented Apr 17, 2025

Deploy Preview for stackable-docs ready!

Name Link
🔨 Latest commit 5b57be3
🔍 Latest deploy log https://app.netlify.com/sites/stackable-docs/deploys/68122e447ba79400081e01a1
😎 Deploy Preview https://deploy-preview-731--stackable-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@Techassi Techassi marked this pull request as ready for review April 24, 2025 11:39
@Techassi Techassi moved this from Development: In Progress to Development: In Review in Stackable Engineering Apr 24, 2025
@Techassi Techassi requested a review from NickLarsenNZ April 24, 2025 12:43
Copy link
Member

@NickLarsenNZ NickLarsenNZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified what I meant with the signals vs subscribers

Techassi and others added 2 commits April 25, 2025 10:04
Co-authored-by: Nick <[email protected]>
@Techassi Techassi requested a review from NickLarsenNZ April 25, 2025 08:15
NickLarsenNZ
NickLarsenNZ previously approved these changes Apr 25, 2025
Copy link
Member

@NickLarsenNZ NickLarsenNZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +25 to +27
*** xref:labels.adoc[Labels]
*** xref:logging.adoc[Logging]
*** xref:containerdebug.adoc[Container environment]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These labels are redundant, no? They should be pulled in automatically since they're the top headings of their respective pages.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, they are.

I added them because I moved the files into the observability subfolder, but that broke links in other repos (operators for example). So I reverted that change, but kept the labels for the future when we eventually move them again (which I think we should).

@@ -0,0 +1,86 @@
= Telemetry signals in operators
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put operator logging as a subheader in logging.adoc instead, I think? Yes, "telemetry" is technically slightly more general, but "logging" is the term people know and love already (and the current split of "logging as in products, telemetry as in operators" makes even less sense IMO).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"logging" is the term people know and love already

I would like to avoid bundling all these signal into logging. It is more than that already, and will differ even more once we add support for metrics (or even profiling).

logging as in products, telemetry as in operators

There will be a transition period. Eventually, we also want to support / enable telemetry data for products. This will then form a complete picture, which can also be reflected in the documentation accordingly.

But for now, there are differences between operators and products which are worth to be pointed out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to avoid bundling all these signal into logging.

Tracing is (an extended form of) logging.

Metrics and profiling are not. They (would) deserve their own sections.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracing can be considered an extended form of logging, but the OpenTelemetry project specifies them as two different signals. As such, I will follow the common terminology and will not mix both together to avoid ambiguity.

@@ -0,0 +1,86 @@
= Telemetry signals in operators

Since SDP 25.7.0, all Stackable operators emit telemetry data in the following ways:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, I think it's clearer to reword this focusing on "logging" rather than the pretty opaque "telemetry data". Once we do offer non-log telemetry (metrics), we can have an "oh, and we also export metrics via OTLP and Prom" note at the end (or just treat metrics as entirely separate).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And yes, I'd argue that "traces" are just logs with a higher standard of correlation information, not a distinct concept.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, I think it's clearer to reword this focusing on "logging" rather than the pretty opaque "telemetry data"

I disagree. I think we want to be promoting "observability" of the platform and move on from "Logging" being the central term. It is merely one of the telemetry signals.

And yes, I'd argue that "traces" are just logs with a higher standard of correlation information

No, they are another telemetry signal. Otherwise they'd just be called logs.

Once we do offer non-log telemetry (metrics), we can have an "oh, and we also export metrics via OTLP and Prom" note at the end (or just treat metrics as entirely separate).

Even though we don't yet have operators emitting metrics, I think we can still begin advertising "observability" capabilities. Once metrics are available, they can be added into this documentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they are another telemetry signal. Otherwise they'd just be called logs.

Okay, what's the conceptual difference? When, in your view, would traditional flat logs still be preferable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, what's the conceptual difference?

A trace is a collection of spans, which contain temporal data to record (and eventually visualize) the distributed path of a specific interaction within one or more applications.

A log is just a piece of text with no duration, with no ability to be propagated across applications and most of the time without structured data.

When, in your view, would traditional flat logs still be preferable?

For most things, I would consider traditional logs rather cumbersome to work with and as such are not preferable over more structured data like traces (and metrics). Traditional logs are mostly only useful for a quick debugging session.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logs can still be structured (https://github.com/slog-rs/slog, https://datalust.co/seq, etc). Traces are "just" a convention for including more context. But it's not a fundamental change.

For most things, I would consider traditional logs rather cumbersome to work with and as such are not preferable over more structured data like traces (and metrics). Traditional logs are mostly only useful for a quick debugging session.

Exactly. Because it's just "Logs 2.0", not a distinct new concept.

Copy link
Member

@NickLarsenNZ NickLarsenNZ Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what your point is here, but it seems like "I can make one data structure look like something else to imply they are the same in one case", and then coining distracting terms like "Logs 2.0" which isn't a thing.

Tracing is a conceptually different way of observing a system, no matter how it is serialized or transported. We should use the common terminology which isn't new anymore. Calling everything logging is not helpful.

IMO, structured logs are largely eliminated by tracing events.
I personally don't see value in logs that don't correlate to a span - especially when troubleshooting. But I'm also aware that people still cling to logs, hence why they are kept as an independent signal (the same content appearing in trace events, which the "structured" fields becoming span/resource attributes).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logs can still be structured

Sure, I'm not arguing against that, but that still doesn't mean that structured logs are the same thing as traces. Calling them "Logs 2.0" undercuts the value they provide.

That's why I won't bundle (traditional) logs and traces together as already stated in #731 (comment).

* OpenTelemetry logs exported via OTLP
* OpenTelemetry traces exported via OTLP

Every subscriber can be toggled and customized using Helm values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/subscriber/logging endpoint/g? There's still probably a better word hiding somewhere in there though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in two minds whether to call them Subscribers (as they are called in tracing_subscriber) or Exporters (OpenTelemetry parlance).

I do feel like Exporter is more intuitive without knowing the crates.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'd be in favour of calling them exporters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, "log to file" isn't really a useful option in its own right IMO (we're only writing to a container file, after all), that's more of an implementation detail for shipping to something like Vector.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These log files can also be picked up by other log scraping solutions users might want to implement. For us, yes, there are mostly only useful to feed Vector.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is a vital distinction: the documentation is defining our supported interfaces. Do we support scraping the raw JSON files? Is that a stable format that we're committing to? How do we suggest that people hook in to scrape it?

Copy link
Member Author

@Techassi Techassi Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be noted, that it is currently (25.3.0) not possible to use any of the (rolling) file log functionality via Helm. The required env vars are just not set by the Deployment template. We corrected this, by introducing this control knob.

Is that a stable format that we're committing to?

We don't have to make any promises regarding the log format in my opinion. This can also be pointed out in the documentation. If we want to make a commitment, we can discuss the details of this in a followup issue/PR (needs research).

How do we suggest that people hook in to scrape it?

I would say this is outside of the current PR. Also, I don't see the need to go into this too deep, as we want to promote the OpenTelemetry appoach anyway. But this is also something which can be tackled in a followup PR (needs refinement and planning).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking of stability in general, I think we gain some stability/predictability by using field names that align with [OpenTelemetry Semantic Conventions]. This will help them correlate with logs/traces/metrics emitted by external tools that also move to align to [OpenTelemetry Semantic Conventions].

Comment on lines +33 to +44
[source,yaml]
----
telemetry:
consoleLog:
enabled: true # <1>
level: null # <2>
format: null # <3>
----

<1> Boolean: `true, false`
<2> Enum: `error, warning, info, debug, trace`
<3> Enum: `plain, json`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels very disjointed to have to jump back and forth between these, maybe just add the options inline? (enabled: true # true/false)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually had that at one point. I've converted them to callouts to be a little more consistent with the rest of out documentation.

But I don't have strong feelings for either option.

Comment on lines +55 to +56
enabled: false # <1>
level: null # <2>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

M'yeah, I get why these are treated as separate ("just jump to the platform you use, and go!"), but it also feels a bit strange that we treat these as entirely separate. I don't know, I don't have a good answer here.

Copy link
Member Author

@Techassi Techassi Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it also feels a bit strange that we treat these as entirely separate.

These being? I assume you mean the different signals (4 currently). Our Tracing instance allows to customize these four signals as individual parts. As such, it makes sense to give users the freedom to choose what to use / export. Users who might want to go all in with OpenTelemetry can then entirely disable console and file logging.

If this was not you were getting at, please clarify what these refer to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The different logging appenders, from a documentation perspective.

Currently they're documented as "We provide these four appenders. Each appender supports [these options].". I'm wondering if it might be clearer/more maintainable to write it as "We support these appenders, which can all be enabled/filtered [like this]. Each appender msut be connected to a target, [like this].". (That is, focus on the commonalities, then branch off into the particulars of each appender.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The different logging appenders

We are not only talking about logging appenders. There is a console log subscriber, a file log appender, as well as log and traces exporters. That's why @NickLarsenNZ suggested to call these consumers exporters to have a unified name without exposing the users to underlying crate details, see #731 (comment).

I'm wondering if it might be clearer/more maintainable to write it as "We support these appenders, which can all be enabled/filtered [like this] [...] That is, focus on the commonalities

Okay, I can see your point. We can do that, but this will result in "jumping back and forth" for the reader, which you pointed out as a negative point in #731 (comment).

So let me know how you weight the pros and cons here.

Co-authored-by: Natalie Klestrup Röijezon <[email protected]>
@Techassi Techassi requested a review from NickLarsenNZ April 30, 2025 06:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Development: In Review
Development

Successfully merging this pull request may close these issues.

None yet

3 participants