Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] YACE returning last non-null value for S3 replication metric, instead of null/NaN #1626

Open
1 task done
notaninterestingusername opened this issue Jan 16, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@notaninterestingusername

Is there an existing issue for this?

  • I have searched the existing issues

YACE version

0.58.0

Config file

apiVersion: v1alpha1
sts-region: us-east-1
discovery:
  jobs: 
  - type: s3
    regions:
      - us-east-1
    searchTags:
      - key: customKey
        value: customValue
    roles:
      - roleArn: "aws:role_arn"
    delay: 600
    metrics:
      - name: ReplicationLatency
        statistics:
        - Maximum
        - Minimum
        - Average
        - Sum
        period: 600
        length: 600
      - name: BytesPendingReplication
        statistics:
        - Maximum
        - Minimum
        - Average
        - Sum
        period: 600
        length: 600
      - name: OperationsPendingReplication
        statistics:
        - Maximum
        - Minimum
        - Average
        - Sum
        period: 600
        length: 600
      - name: OperationsFailedReplication
        statistics:
        - Maximum
        - Minimum
        - Average
        - Sum
        - SampleCount
        period: 600
        length: 600

Current Behavior

I'm attempting to scrape S3 replication metrics, and have observed some odd behaviour with one of the metrics, OperationsFailedReplication.

On CloudWatch, this metric presents a data point when there is a replication action (triggered when objects are uploaded to the source bucket, a replication rule is in place, and replication metrics are enabled for the rule), and is null, otherwise. If replication is successful, a zero is produced. If replication fails, a value that equals the number of objects that couldn't be replicated, is produced.

YACE appears to emit a the last non-null value that it encounters. This value only changes at the time of the next replication, when the metric assumes the new non-null value, holding it there until it changes again. This is not desired; we would expect YACE to emit nulls when there isn't a value to report.

On hitting the CloudWatch get-metric-statistics API using an equivalent range (start and end times are set to match YACE's current scraping interval, which is 5 minutes, with a delay of 10 minutes, to match YACE's configs shown above), period, and all the other necessary parameters (metric name, dimension names and values, region, etc.), a []is returned.

Verified that YACE was, in fact, returning the last non-null value by hitting the /metrics endpoint on the container that YACE is running on, and, sure it enough, it returns the last non-null value for the metric.

What could be causing this? How may this be fixed?

Attached are screenshots from the AWS console (UTC) and Grafana (which relies on metrics coming into Prometheus from YACE, UTC-8).

image image

Expected Behavior

Expect YACE to return null/NaN when there is no data point coming in from CloudWatch.

Steps To Reproduce

  1. Run YACE with the configs detailed above
  2. Create two S3 buckets, B1 and B2
  3. Create a replication rule to copy objects from B1 to B2, but use an IAM role with insufficient permissions to do so
  4. Upload objects to B1, which fail to be replicated to B2. Metrics take ~ 5 minutes to appear on CloudWatch

Anything else?

No response

@notaninterestingusername notaninterestingusername added the bug Something isn't working label Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant