Skip to content

Conversation

conallob
Copy link

Add Guidance wrt Labelling to Naming and Rules Best Practices to docs/practices/naming.md and docs/practices/rules.md, specifically:

  • The primary purposes of job and instance
  • Include WARNINGS about accidentally stripping the job label, especially in multi-tenant systems

This Fixes #2690

@conallob
Copy link
Author

Friendly ping @SuperQ @beorn7 ?

@conallob
Copy link
Author

conallob commented Jul 28, 2025

Obligatory post-it note reminder: https://photos.app.goo.gl/Bkfir4wRiLtNVG4W8

@beorn7 beorn7 requested review from SuperQ and juliusv July 29, 2025 10:48
@beorn7
Copy link
Member

beorn7 commented Jul 29, 2025

With my current patchy availability, there is little chance I get to this anytime soon. Maybe @juliusv has a qualified opinion here?

@conallob
Copy link
Author

conallob commented Aug 3, 2025

Friendly ping @SuperQ @juliusv

@andrechalella
Copy link

Hey @conallob, congrats for the nice PR!

Just one thing: maybe you could fix the eaach typo in

  • The job label is a primary key to differentiate metrics from eaach other.

There is a small confusion I would love to see fixed, in a paragraph just below one of your edits. It is:

To keep the operations clean, _sum is omitted if there are other operations,
as sum().

I don't understand the as sum() part. Like, "x is omitted if there are other operations such as x"? It doesn't make sense to me, in a very basic way. I know it's out of the scope of this PR, but maybe you could touch it to clarify.

Signed-off-by: Conall O'Brien <[email protected]>
@conallob
Copy link
Author

Hey @conallob, congrats for the nice PR!

Just one thing: maybe you could fix the eaach typo in

  • The job label is a primary key to differentiate metrics from eaach other.

There is a small confusion I would love to see fixed, in a paragraph just below one of your edits. It is:

Fixed the typo.

To keep the operations clean, _sum is omitted if there are other operations,
as sum().

I don't understand the as sum() part. Like, "x is omitted if there are other operations such as x"? It doesn't make sense to me, in a very basic way. I know it's out of the scope of this PR, but maybe you could touch it to clarify.

I'm afraid that best practice is unrelated.

It also makes sense as written, once you've written enough rules. It's weighing up the trade-off between tracking the chain of operations across a pipeline of rules vs the rule name growing unwieldy. Many of these best practices trace back to specific philosophies from Prometheus' predecessor.

If you still think it needs a polish, please a separate doc bug.

@conallob
Copy link
Author

Ping @juliusv , since @SuperQ is currently unavailable for life reasons

Co-authored-by: Ben Kochie <[email protected]>
Signed-off-by: Conall O'Brien <[email protected]>
Co-authored-by: Ben Kochie <[email protected]>
Signed-off-by: Conall O'Brien <[email protected]>
@conallob conallob requested a review from SuperQ August 15, 2025 15:54
Keeping the metric name unchanged makes it easy to know what a metric is and
easy to find in the codebase.

IMPORTANT: `job` label acts as a primary key. It is **strongly** recommended that you use it to scope your PromQL expressions to the system you are monitoring.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is misleading. Prometheus doesn't have the concept of "primary key". Not even metric names are a "primary key".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, especially since folks used to SQL DBs will jump to the conclusion that it's a SQL DB, which it isn't.

Iterated on the language to avoid creating ambiguity

Iterate on the description of the job label, removing "primary key", given it's association with SQL

Signed-off-by: Conall O'Brien <[email protected]>
@conallob
Copy link
Author

PTAL

@conallob
Copy link
Author

Friendly ping?

@conallob conallob requested a review from SuperQ August 25, 2025 08:36
@conallob
Copy link
Author

conallob commented Oct 6, 2025

Friendly, you're not at SRECon EMEA this week, ping?

WARNING: When using `without`, be careful not to strip out the `job` label accidentally.

* `instance`
* The `instance` label will include the `ip:port` what was scraped, providing a crucial breadcrumb for debugging scrape time issues
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The `instance` label will include the `ip:port` what was scraped, providing a crucial breadcrumb for debugging scrape time issues
* The `instance` label by default will include the `ip:port` what was scraped.

## Labels

* `job`
* The `job` label is one of the few ubiquitious labels, set at scrape time, and is used to identify metrics scraped from the same target/exporter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The `job` label is one of the few ubiquitious labels, set at scrape time, and is used to identify metrics scraped from the same target/exporter.
* The `job` is a default target label set by the scrape configs and is used to identify metrics scraped from the same target/exporter.


* `job`
* The `job` label is one of the few ubiquitious labels, set at scrape time, and is used to identify metrics scraped from the same target/exporter.
* If not specified in PromQL expressions, they will match unrelated metrics with the same name. This is especially true in a multi system or multi tenant installation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is really a useful note here, as this applies to all label matching.

Suggested change
* If not specified in PromQL expressions, they will match unrelated metrics with the same name. This is especially true in a multi system or multi tenant installation

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It applies to all labels. But job and instance are two uniform labels found on every metric, including ubiquitous synthetic metrics such as up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it's not related to job, but related to "target labels" and discovery. That is a different thing and related to querying, not creating labels.

Comment on lines +89 to +90
WARNING: When using `without`, be careful not to strip out the `job` label accidentally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning doesn't make a lot of sense to me. It has a high probability of being quoted as copy-pasta without being understood. Let's just drop it.

Suggested change
WARNING: When using `without`, be careful not to strip out the `job` label accidentally.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to encourage copy-pasta, but this is an important point.

If using alerting expressions like up{job=bla} > 0 for 3m , you need to be careful not to accidentally strip the job label. If you do, your alert no longer works as intended.

I'll polish the wording here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem related to naming practices, which is what this guide is about.

Co-authored-by: Ben Kochie <[email protected]>
Signed-off-by: Conall O'Brien <[email protected]>
@conallob
Copy link
Author

conallob commented Oct 7, 2025

For perspective, one of the motivations behind this PR is the anti-patterm of writing alert expressions intended for a single tenant system, which has evolved into a multi-tenant system.

e.g up{} for 5m without defining a job label works for one job.

Once you start adding additional jobs that match on the same labels (e.g Daemonsets, fleet-wide node_exporter, etc), teams start getting paged for systems they don't own or care about

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Best Practice Docs Don't Call Out the Importance of Job Label

4 participants