Skip to content

Changes in time series of metrics caused by container name changes #37268

@mike9421

Description

@mike9421

Component(s)

receiver/prometheus

What happened?

Description

When collecting metrics throug kubernetes_sd_configs, occasional label changes may occur.

Steps to Reproduce

  1. Use kubernetes_sd_configs to configure a prometheus scraping job.
    eg.
kubernetes_sd_configs:
- role: pod
  namespaces:
    names:
    - default
  selectors:
  - role: pod
    label: label=value-1

At this point, occasional label changes may already occur. And you can also perform the following operations to make it easier to detect when the above-mentioned issues occur."
2. Use the cumulativeToDelta component in the OTel configuration to convert metrics to delta.

Expected Result

If the pod corresponding to the scraped instance does not change, the labels of the metrics should always remain the same.

Actual Result

The situation where occasional metric label changes occur.

Collector version

v0.95.0 (after troubleshooting, the latest version is likely to experience the same issue)

Environment information

Environment

OS: kubernetes
Compiler: go 1.22.0

OpenTelemetry Collector configuration

exporters:
  file:
    path: ./otel.log
extensions:
  health_check:
  memory_ballast:
    size_mib: "256"
processors:
  batch:
    send_batch_size: 100
    send_batch_max_size: 100
    timeout: 5s
  memory_limiter:
    check_interval: 1s
    limit_mib: 2056
  cumulativetodelta:
receivers:
  prometheus:
    report_extra_scrape_metrics: true
    trim_metric_suffixes: false
    config:
      scrape_configs:
        - job_name: 'scrape-prometheus-open-metrics'
          kubernetes_sd_configs:
            - role: pod
              namespaces:
                names:
                  - default
              selectors:
                - role: pod
                  field: spec.nodeName=$KUBE_NODE_NAME
          scrape_interval: 10s
          scrape_timeout: 10s
          metrics_path: /openMetrics
          scheme: http
          relabel_configs:
          - source_labels: [__address__]
            target_label: __address__
            regex: ([^:]+)(?::\d+)?
            replacement: $$1:26666
            action: replace
service:
  extensions:
    - health_check
    - memory_ballast
  pipelines:
    metrics/prometheus:
      receivers:
        - prometheus
      processors:
        - memory_limiter
        - cumulativetodelta
        - batch
      exporters:
        - file

Log output

Additional context

We discovered this issue because we observed that the value of the delta metric was the same as that of the cumulative metric. That is, there would be a sharp increase at a certain point.

After troubleshooting, we believe that when OTel updates the target information, some metrics read the target information, causing occasional and brief label errors. The following are the causes we believe.

  1. For targets with the same URL, prometheus will update DiscoveredLabels with the information of the most recently processed target.
  2. In the meantime, prometheus receiver will read the DiscoveredLabels of the target.
  3. Step one and step two are concurrent. If step two is executed before step one is fully completed, it will cause step two to read the information of an unexpected target.

The following are examples.

Image

And I have a question. Why does prometheus receiver need to add nodeResources?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions