- Sponsor
-
Notifications
You must be signed in to change notification settings - Fork 13
docs: Add telemetry documentation page #731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for stackable-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
9f996a9
to
75cf3d7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified what I meant with the signals vs subscribers
Co-authored-by: Nick <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
*** xref:labels.adoc[Labels] | ||
*** xref:logging.adoc[Logging] | ||
*** xref:containerdebug.adoc[Container environment] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These labels are redundant, no? They should be pulled in automatically since they're the top headings of their respective pages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, they are.
I added them because I moved the files into the observability
subfolder, but that broke links in other repos (operators for example). So I reverted that change, but kept the labels for the future when we eventually move them again (which I think we should).
@@ -0,0 +1,86 @@ | |||
= Telemetry signals in operators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd put operator logging as a subheader in logging.adoc
instead, I think? Yes, "telemetry" is technically slightly more general, but "logging" is the term people know and love already (and the current split of "logging as in products, telemetry as in operators" makes even less sense IMO).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"logging" is the term people know and love already
I would like to avoid bundling all these signal into logging. It is more than that already, and will differ even more once we add support for metrics (or even profiling).
logging as in products, telemetry as in operators
There will be a transition period. Eventually, we also want to support / enable telemetry data for products. This will then form a complete picture, which can also be reflected in the documentation accordingly.
But for now, there are differences between operators and products which are worth to be pointed out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to avoid bundling all these signal into logging.
Tracing is (an extended form of) logging.
Metrics and profiling are not. They (would) deserve their own sections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracing can be considered an extended form of logging, but the OpenTelemetry project specifies them as two different signals. As such, I will follow the common terminology and will not mix both together to avoid ambiguity.
@@ -0,0 +1,86 @@ | |||
= Telemetry signals in operators | |||
|
|||
Since SDP 25.7.0, all Stackable operators emit telemetry data in the following ways: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, I think it's clearer to reword this focusing on "logging" rather than the pretty opaque "telemetry data". Once we do offer non-log telemetry (metrics), we can have an "oh, and we also export metrics via OTLP and Prom" note at the end (or just treat metrics as entirely separate).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(And yes, I'd argue that "traces" are just logs with a higher standard of correlation information, not a distinct concept.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, I think it's clearer to reword this focusing on "logging" rather than the pretty opaque "telemetry data"
I disagree. I think we want to be promoting "observability" of the platform and move on from "Logging" being the central term. It is merely one of the telemetry signals.
And yes, I'd argue that "traces" are just logs with a higher standard of correlation information
No, they are another telemetry signal. Otherwise they'd just be called logs.
Once we do offer non-log telemetry (metrics), we can have an "oh, and we also export metrics via OTLP and Prom" note at the end (or just treat metrics as entirely separate).
Even though we don't yet have operators emitting metrics, I think we can still begin advertising "observability" capabilities. Once metrics are available, they can be added into this documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, they are another telemetry signal. Otherwise they'd just be called logs.
Okay, what's the conceptual difference? When, in your view, would traditional flat logs still be preferable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, what's the conceptual difference?
A trace is a collection of spans, which contain temporal data to record (and eventually visualize) the distributed path of a specific interaction within one or more applications.
A log is just a piece of text with no duration, with no ability to be propagated across applications and most of the time without structured data.
When, in your view, would traditional flat logs still be preferable?
For most things, I would consider traditional logs rather cumbersome to work with and as such are not preferable over more structured data like traces (and metrics). Traditional logs are mostly only useful for a quick debugging session.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logs can still be structured (https://github.com/slog-rs/slog, https://datalust.co/seq, etc). Traces are "just" a convention for including more context. But it's not a fundamental change.
For most things, I would consider traditional logs rather cumbersome to work with and as such are not preferable over more structured data like traces (and metrics). Traditional logs are mostly only useful for a quick debugging session.
Exactly. Because it's just "Logs 2.0", not a distinct new concept.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what your point is here, but it seems like "I can make one data structure look like something else to imply they are the same in one case", and then coining distracting terms like "Logs 2.0" which isn't a thing.
Tracing is a conceptually different way of observing a system, no matter how it is serialized or transported. We should use the common terminology which isn't new anymore. Calling everything logging is not helpful.
IMO, structured logs are largely eliminated by tracing events.
I personally don't see value in logs that don't correlate to a span - especially when troubleshooting. But I'm also aware that people still cling to logs, hence why they are kept as an independent signal (the same content appearing in trace events, which the "structured" fields becoming span/resource attributes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logs can still be structured
Sure, I'm not arguing against that, but that still doesn't mean that structured logs are the same thing as traces. Calling them "Logs 2.0" undercuts the value they provide.
That's why I won't bundle (traditional) logs and traces together as already stated in #731 (comment).
* OpenTelemetry logs exported via OTLP | ||
* OpenTelemetry traces exported via OTLP | ||
|
||
Every subscriber can be toggled and customized using Helm values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/subscriber/logging endpoint/g? There's still probably a better word hiding somewhere in there though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in two minds whether to call them Subscribers (as they are called in tracing_subscriber
) or Exporters (OpenTelemetry parlance).
I do feel like Exporter is more intuitive without knowing the crates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'd be in favour of calling them exporters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, "log to file" isn't really a useful option in its own right IMO (we're only writing to a container file, after all), that's more of an implementation detail for shipping to something like Vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These log files can also be picked up by other log scraping solutions users might want to implement. For us, yes, there are mostly only useful to feed Vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is a vital distinction: the documentation is defining our supported interfaces. Do we support scraping the raw JSON files? Is that a stable format that we're committing to? How do we suggest that people hook in to scrape it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be noted, that it is currently (25.3.0) not possible to use any of the (rolling) file log functionality via Helm. The required env vars are just not set by the Deployment template. We corrected this, by introducing this control knob.
Is that a stable format that we're committing to?
We don't have to make any promises regarding the log format in my opinion. This can also be pointed out in the documentation. If we want to make a commitment, we can discuss the details of this in a followup issue/PR (needs research).
How do we suggest that people hook in to scrape it?
I would say this is outside of the current PR. Also, I don't see the need to go into this too deep, as we want to promote the OpenTelemetry appoach anyway. But this is also something which can be tackled in a followup PR (needs refinement and planning).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaking of stability in general, I think we gain some stability/predictability by using field names that align with [OpenTelemetry Semantic Conventions]. This will help them correlate with logs/traces/metrics emitted by external tools that also move to align to [OpenTelemetry Semantic Conventions].
[source,yaml] | ||
---- | ||
telemetry: | ||
consoleLog: | ||
enabled: true # <1> | ||
level: null # <2> | ||
format: null # <3> | ||
---- | ||
|
||
<1> Boolean: `true, false` | ||
<2> Enum: `error, warning, info, debug, trace` | ||
<3> Enum: `plain, json` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels very disjointed to have to jump back and forth between these, maybe just add the options inline? (enabled: true # true/false
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually had that at one point. I've converted them to callouts to be a little more consistent with the rest of out documentation.
But I don't have strong feelings for either option.
enabled: false # <1> | ||
level: null # <2> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
M'yeah, I get why these are treated as separate ("just jump to the platform you use, and go!"), but it also feels a bit strange that we treat these as entirely separate. I don't know, I don't have a good answer here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it also feels a bit strange that we treat these as entirely separate.
These being? I assume you mean the different signals (4 currently). Our Tracing
instance allows to customize these four signals as individual parts. As such, it makes sense to give users the freedom to choose what to use / export. Users who might want to go all in with OpenTelemetry can then entirely disable console and file logging.
If this was not you were getting at, please clarify what these refer to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The different logging appenders, from a documentation perspective.
Currently they're documented as "We provide these four appenders. Each appender supports [these options].". I'm wondering if it might be clearer/more maintainable to write it as "We support these appenders, which can all be enabled/filtered [like this]. Each appender msut be connected to a target, [like this].". (That is, focus on the commonalities, then branch off into the particulars of each appender.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The different logging appenders
We are not only talking about logging appenders. There is a console log subscriber, a file log appender, as well as log and traces exporters. That's why @NickLarsenNZ suggested to call these consumers exporters to have a unified name without exposing the users to underlying crate details, see #731 (comment).
I'm wondering if it might be clearer/more maintainable to write it as "We support these appenders, which can all be enabled/filtered [like this] [...] That is, focus on the commonalities
Okay, I can see your point. We can do that, but this will result in "jumping back and forth" for the reader, which you pointed out as a negative point in #731 (comment).
So let me know how you weight the pros and cons here.
Co-authored-by: Natalie Klestrup Röijezon <[email protected]>
Part of stackabletech/issues#639