Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes in time series of metrics caused by container name changes #37268

Open
mike9421 opened this issue Jan 16, 2025 · 2 comments
Open

Changes in time series of metrics caused by container name changes #37268

mike9421 opened this issue Jan 16, 2025 · 2 comments
Assignees
Labels
bug Something isn't working receiver/prometheus Prometheus receiver waiting for author

Comments

@mike9421
Copy link

mike9421 commented Jan 16, 2025

Component(s)

receiver/prometheus

What happened?

Description

When collecting metrics throug kubernetes_sd_configs, occasional label changes may occur.

Steps to Reproduce

  1. Use kubernetes_sd_configs to configure a prometheus scraping job.
    eg.
kubernetes_sd_configs:
- role: pod
  namespaces:
    names:
    - default
  selectors:
  - role: pod
    label: label=value-1

At this point, occasional label changes may already occur. And you can also perform the following operations to make it easier to detect when the above-mentioned issues occur."
2. Use the cumulativeToDelta component in the OTel configuration to convert metrics to delta.

Expected Result

If the pod corresponding to the scraped instance does not change, the labels of the metrics should always remain the same.

Actual Result

The situation where occasional metric label changes occur.

Collector version

v0.95.0 (after troubleshooting, the latest version is likely to experience the same issue)

Environment information

Environment

OS: kubernetes
Compiler: go 1.22.0

OpenTelemetry Collector configuration

exporters:
  file:
    path: ./otel.log
extensions:
  health_check:
  memory_ballast:
    size_mib: "256"
processors:
  batch:
    send_batch_size: 100
    send_batch_max_size: 100
    timeout: 5s
  memory_limiter:
    check_interval: 1s
    limit_mib: 2056
  cumulativetodelta:
receivers:
  prometheus:
    report_extra_scrape_metrics: true
    trim_metric_suffixes: false
    config:
      scrape_configs:
        - job_name: 'scrape-prometheus-open-metrics'
          kubernetes_sd_configs:
            - role: pod
              namespaces:
                names:
                  - default
              selectors:
                - role: pod
                  field: spec.nodeName=$KUBE_NODE_NAME
          scrape_interval: 10s
          scrape_timeout: 10s
          metrics_path: /openMetrics
          scheme: http
          relabel_configs:
          - source_labels: [__address__]
            target_label: __address__
            regex: ([^:]+)(?::\d+)?
            replacement: $$1:26666
            action: replace
service:
  extensions:
    - health_check
    - memory_ballast
  pipelines:
    metrics/prometheus:
      receivers:
        - prometheus
      processors:
        - memory_limiter
        - cumulativetodelta
        - batch
      exporters:
        - file

Log output

Additional context

We discovered this issue because we observed that the value of the delta metric was the same as that of the cumulative metric. That is, there would be a sharp increase at a certain point.

After troubleshooting, we believe that when OTel updates the target information, some metrics read the target information, causing occasional and brief label errors. The following are the causes we believe.

  1. For targets with the same URL, prometheus will update DiscoveredLabels with the information of the most recently processed target.
  2. In the meantime, prometheus receiver will read the DiscoveredLabels of the target.
  3. Step one and step two are concurrent. If step two is executed before step one is fully completed, it will cause step two to read the information of an unexpected target.

The following are examples.

Image

And I have a question. Why does prometheus receiver need to add nodeResources?

@mike9421 mike9421 added bug Something isn't working needs triage New item requiring triage labels Jan 16, 2025
@github-actions github-actions bot added the receiver/prometheus Prometheus receiver label Jan 16, 2025
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@dashpole
Copy link
Contributor

I'm not quite following.

occasional label changes may occur.

Can you give an example of what you mean? Which label changed?

@dashpole dashpole added waiting for author and removed needs triage New item requiring triage labels Jan 16, 2025
@dashpole dashpole self-assigned this Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working receiver/prometheus Prometheus receiver waiting for author
Projects
None yet
Development

No branches or pull requests

2 participants