-
Notifications
You must be signed in to change notification settings - Fork 47
Open
Milestone
Description
My prometheus scrape jobs stopped getting any data from rack2 after it was updated to omicron commit ae3ca81
. Most of the scrap jobs look for data points collected in the most recent 1 or 2 minutes. They used to get data back consistently until this recent rack update. I was able to get some data if I put "@now() - 45m" as the interval but I've also seen no data for a whole hour, e.g.
oxide experimental system timeseries query --query 'get hardware_component:amd_cpu_tctl | filter timestamp > @now() - 1h | last 1'
{
"tables": [
{
"name": "hardware_component:amd_cpu_tctl",
"timeseries": {}
}
]
}
In the oximeter logs, I see errors like this which didn't exist in earlier logs:
20:29:51.751Z WARN oximeter (oximeter-agent): failed to insert some results into metric DB
collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
collector_ip = fd00:1122:3344:10a::3
error = Telemetry database unavailable: SQL query timed out after 30.000955486s
file = oximeter/collector/src/results_sink.rs:92
There are also frequent errors like the one below but they are also there prior to the recent SW update:
22:28:10.427Z ERRO oximeter (oximeter-agent): timer-based collection request queue is full! This may indicate that the producer has a sampling interval that is too fast for the amount of data it generates
collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
collector_ip = fd00:1122:3344:10a::3
file = oximeter/collector/src/collection_task.rs:845
interval = 1s
producer_id = c334fc56-155a-4d7f-a2c9-e104f73603a2
The ClickHouse database were up and running when I logged into them. I'll see if there is anything useful from their log files during the database unavailable moments.
Metadata
Metadata
Assignees
Labels
No labels