Skip to content

Oximeter frequently hit SQL query timeout errors on rack2 #8595

@askfongjojo

Description

@askfongjojo

My prometheus scrape jobs stopped getting any data from rack2 after it was updated to omicron commit ae3ca81. Most of the scrap jobs look for data points collected in the most recent 1 or 2 minutes. They used to get data back consistently until this recent rack update. I was able to get some data if I put "@now() - 45m" as the interval but I've also seen no data for a whole hour, e.g.

oxide experimental system timeseries query --query 'get hardware_component:amd_cpu_tctl | filter timestamp > @now() - 1h | last 1'
{
  "tables": [
    {
      "name": "hardware_component:amd_cpu_tctl",
      "timeseries": {}
    }
  ]
}

In the oximeter logs, I see errors like this which didn't exist in earlier logs:

20:29:51.751Z WARN oximeter (oximeter-agent): failed to insert some results into metric DB
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    error = Telemetry database unavailable: SQL query timed out after 30.000955486s
    file = oximeter/collector/src/results_sink.rs:92

There are also frequent errors like the one below but they are also there prior to the recent SW update:

22:28:10.427Z ERRO oximeter (oximeter-agent): timer-based collection request queue is full! This may indicate that the producer has a sampling interval that is too fast for the amount of data it generates
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    file = oximeter/collector/src/collection_task.rs:845
    interval = 1s
    producer_id = c334fc56-155a-4d7f-a2c9-e104f73603a2

The ClickHouse database were up and running when I logged into them. I'll see if there is anything useful from their log files during the database unavailable moments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions