Changes in time series of metrics caused by container name changes #37268

mike9421 · 2025-01-16T10:21:11Z

Component(s)

receiver/prometheus

What happened?

Description

When collecting metrics throug kubernetes_sd_configs, occasional label changes may occur.

Steps to Reproduce

Use kubernetes_sd_configs to configure a prometheus scraping job.
eg.

kubernetes_sd_configs:
- role: pod
  namespaces:
    names:
    - default
  selectors:
  - role: pod
    label: label=value-1

At this point, occasional label changes may already occur. And you can also perform the following operations to make it easier to detect when the above-mentioned issues occur."
2. Use the cumulativeToDelta component in the OTel configuration to convert metrics to delta.

Expected Result

If the pod corresponding to the scraped instance does not change, the labels of the metrics should always remain the same.

Actual Result

The situation where occasional metric label changes occur.

Collector version

v0.95.0 (after troubleshooting, the latest version is likely to experience the same issue)

Environment information

Environment

OS: kubernetes
Compiler: go 1.22.0

OpenTelemetry Collector configuration

exporters:
  file:
    path: ./otel.log
extensions:
  health_check:
  memory_ballast:
    size_mib: "256"
processors:
  batch:
    send_batch_size: 100
    send_batch_max_size: 100
    timeout: 5s
  memory_limiter:
    check_interval: 1s
    limit_mib: 2056
  cumulativetodelta:
receivers:
  prometheus:
    report_extra_scrape_metrics: true
    trim_metric_suffixes: false
    config:
      scrape_configs:
        - job_name: 'scrape-prometheus-open-metrics'
          kubernetes_sd_configs:
            - role: pod
              namespaces:
                names:
                  - default
              selectors:
                - role: pod
                  field: spec.nodeName=$KUBE_NODE_NAME
          scrape_interval: 10s
          scrape_timeout: 10s
          metrics_path: /openMetrics
          scheme: http
          relabel_configs:
          - source_labels: [__address__]
            target_label: __address__
            regex: ([^:]+)(?::\d+)?
            replacement: $$1:26666
            action: replace
service:
  extensions:
    - health_check
    - memory_ballast
  pipelines:
    metrics/prometheus:
      receivers:
        - prometheus
      processors:
        - memory_limiter
        - cumulativetodelta
        - batch
      exporters:
        - file

Log output

Additional context

We discovered this issue because we observed that the value of the delta metric was the same as that of the cumulative metric. That is, there would be a sharp increase at a certain point.

After troubleshooting, we believe that when OTel updates the target information, some metrics read the target information, causing occasional and brief label errors. The following are the causes we believe.

For targets with the same URL, prometheus will update DiscoveredLabels with the information of the most recently processed target.
In the meantime, prometheus receiver will read the DiscoveredLabels of the target.
Step one and step two are concurrent. If step two is executed before step one is fully completed, it will cause step two to read the information of an unexpected target.

The following are examples.

And I have a question. Why does prometheus receiver need to add nodeResources?

The text was updated successfully, but these errors were encountered:

github-actions · 2025-01-16T10:21:27Z

Pinging code owners:

receiver/prometheus: @Aneurysm9 @dashpole

See Adding Labels via Comments if you do not have permissions to add labels yourself.

dashpole · 2025-01-16T15:22:25Z

I'm not quite following.

occasional label changes may occur.

Can you give an example of what you mean? Which label changed?

mike9421 added bug Something isn't working needs triage New item requiring triage labels Jan 16, 2025

github-actions bot added the receiver/prometheus Prometheus receiver label Jan 16, 2025

dashpole added waiting for author and removed needs triage New item requiring triage labels Jan 16, 2025

dashpole self-assigned this Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes in time series of metrics caused by container name changes #37268

Changes in time series of metrics caused by container name changes #37268

mike9421 commented Jan 16, 2025 •

edited by dashpole

Loading

github-actions bot commented Jan 16, 2025

dashpole commented Jan 16, 2025

Changes in time series of metrics caused by container name changes #37268

Changes in time series of metrics caused by container name changes #37268

Comments

mike9421 commented Jan 16, 2025 • edited by dashpole Loading

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Jan 16, 2025

dashpole commented Jan 16, 2025

mike9421 commented Jan 16, 2025 •

edited by dashpole

Loading