Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[chrony] Chrony plugin fails to report after connection timeout and never recovers #16495

Open
pkkrusty opened this issue Feb 10, 2025 · 1 comment
Labels
bug unexpected problem or unintended behavior

Comments

@pkkrusty
Copy link

Relevant telegraf.conf

[agent]
  ## Default data collection interval for all inputs
  interval = "3s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will send metrics to outputs in batches of at most
  ## metric_batch_size metrics.
  ## This controls the size of writes that Telegraf sends to output plugins.
  metric_batch_size = 1000

  ## Maximum number of unwritten metrics per output.  Increasing this value
  ## allows for longer periods of output downtime without dropping metrics at the
  ## cost of higher maximum memory usage.
  metric_buffer_limit = 10000

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Collection offset is used to shift the collection by the given amount.
  ## This can be be used to avoid many plugins querying constraint devices
  ## at the same time by manually scheduling them in time.
  # collection_offset = "0s"

  ## Default flushing interval for all outputs. Maximum flush_interval will be
  ## flush_interval + flush_jitter
  flush_interval = "10s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"
--------------
[[inputs.chrony]]
  ## Server address of chronyd with address scheme
  ## If empty or not set, the plugin will mimic the behavior of chronyc and
  ## check "unixgram:///run/chrony/chronyd.sock", "udp://127.0.0.1:323"
  ## and "udp://[::1]:323".
   server = "udp://192.168.1.23:323"

  ## Timeout for establishing the connection
#   timeout = "5s"

  ## Try to resolve received addresses to host-names via DNS lookups
  ## Disabled by default to avoid DNS queries especially for slow DNS servers.
  # dns_lookup = false

  ## Metrics to query named according to chronyc commands
  ## Available settings are:
  ##   activity    -- number of peers online or offline
  ##   tracking    -- information about system's clock performance
  ##   serverstats -- chronyd server statistics
  ##   sources     -- extended information about peers
  ##   sourcestats -- statistics on peers
   metrics = ["tracking"]

Logs from Telegraf

2025-02-10T07:39:54Z D! [inputs.chrony] Previous collection has not completed; scheduled collection skipped
2025-02-10T07:39:54Z W! [inputs.chrony] Collection took longer than expected; not complete after interval of 3s
2025-02-10T07:39:57Z D! [inputs.chrony] Previous collection has not completed; scheduled collection skipped
2025-02-10T07:39:57Z W! [inputs.chrony] Collection took longer than expected; not complete after interval of 3s
2025-02-10T07:40:00Z D! [inputs.chrony] Previous collection has not completed; scheduled collection skipped
2025-02-10T07:40:00Z W! [inputs.chrony] Collection took longer than expected; not complete after interval of 3s
2025-02-10T07:40:01Z D! [outputs.influxdb] Wrote batch of 45 metrics in 980.984  s
2025-02-10T07:40:01Z D! [outputs.influxdb] Buffer fullness: 1 / 10000 metrics
2025-02-10T07:40:03Z D! [inputs.chrony] Previous collection has not completed; scheduled collection skipped
2025-02-10T07:40:03Z W! [inputs.chrony] Collection took longer than expected; not complete after interval of 3s
2025-02-10T07:40:06Z D! [inputs.chrony] Previous collection has not completed; scheduled collection skipped
2025-02-10T07:40:06Z W! [inputs.chrony] Collection took longer than expected; not complete after interval of 3s
2025-02-10T07:40:09Z D! [inputs.chrony] Previous collection has not completed; scheduled collection skipped
2025-02-10T07:40:09Z W! [inputs.chrony] Collection took longer than expected; not complete after interval of 3s
2025-02-10T07:40:11Z D! [outputs.influxdb] Wrote batch of 43 metrics in 892.577  s
2025-02-10T07:40:11Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2025-02-10T07:40:12Z D! [inputs.chrony] Previous collection has not completed; scheduled collection skipped
2025-02-10T07:40:12Z W! [inputs.chrony] Collection took longer than expected; not complete after interval of 3s
2025-02-10T07:40:15Z D! [inputs.chrony] Previous collection has not completed; scheduled collection skipped
2025-02-10T07:40:15Z W! [inputs.chrony] Collection took longer than expected; not complete after interval of 3s
2025-02-10T07:40:18Z D! [inputs.chrony] Previous collection has not completed; scheduled collection skipped
2025-02-10T07:40:18Z W! [inputs.chrony] Collection took longer than expected; not complete after interval of 3s
2025-02-10T07:40:21Z D! [inputs.chrony] Previous collection has not completed; scheduled collection skipped
2025-02-10T07:40:21Z W! [inputs.chrony] Collection took longer than expected; not complete after interval of 3s
2025-02-10T07:40:21Z D! [outputs.influxdb] Wrote batch of 50 metrics in 931.2  s
2025-02-10T07:40:21Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2025-02-10T07:40:24Z D! [inputs.chrony] Previous collection has not completed; scheduled collection skipped
2025-02-10T07:40:24Z W! [inputs.chrony] Collection took longer than expected; not complete after interval of 3s
2025-02-10T07:40:27Z D! [inputs.chrony] Previous collection has not completed; scheduled collection skipped
2025-02-10T07:40:27Z W! [inputs.chrony] Collection took longer than expected; not complete after interval of 3s
2025-02-10T07:40:30Z I! Loading config: /etc/telegraf/telegraf.conf
2025-02-10T07:40:30Z W! DeprecationWarning: Option "fieldpass" of plugin "inputs.net" deprecated since version 1.29.0 and will be removed in 1.40.0: use 'fieldinclude' instead
2025-02-10T07:40:30Z W! Agent setting "logtarget" is deprecated, please just set "logfile" and remove this setting! The setting will be removed in v1.40.0.
2025-02-10T07:40:30Z I! Starting Telegraf 1.33.0 brought to you by InfluxData the makers of InfluxDB
2025-02-10T07:40:30Z I! Available plugins: 236 inputs, 9 aggregators, 33 processors, 26 parsers, 63 outputs, 6 secret-stores
2025-02-10T07:40:30Z I! Loaded inputs: chrony cpu mem mqtt_consumer net netstat processes swap
2025-02-10T07:40:30Z I! Loaded aggregators:
2025-02-10T07:40:30Z I! Loaded processors:
2025-02-10T07:40:30Z I! Loaded secretstores:
2025-02-10T07:40:30Z I! Loaded outputs: influxdb
2025-02-10T07:40:30Z I! Tags enabled: host=lxc-telegraf-influxdb-grafana
2025-02-10T07:40:30Z I! [agent] Config: Interval:3s, Quiet:false, Hostname:"lxc-telegraf-influxdb-grafana", Flush Interval:10s
2025-02-10T07:40:30Z D! [agent] Initializing plugins
2025-02-10T07:40:30Z W! DeprecationWarning: Value "false" for option "ignore_protocol_stats" of plugin "inputs.net" deprecated since version 1.27.3 and will be removed in 1.36.0: use the 'i>
2025-02-10T07:40:30Z D! [agent] Connecting outputs
2025-02-10T07:40:30Z D! [agent] Attempting connection to [outputs.influxdb]
2025-02-10T07:40:30Z D! [agent] Successfully connected to outputs.influxdb
2025-02-10T07:40:30Z D! [agent] Starting service inputs
2025-02-10T07:40:30Z I! [inputs.mqtt_consumer] Connected [tcp://192.168.1.8:1883]
2025-02-10T07:40:30Z D! [inputs.chrony] Connected to "udp://192.168.1.23:323"...
2025-02-10T07:40:40Z D! [outputs.influxdb] Wrote batch of 57 metrics in 2.321949ms
2025-02-10T07:40:40Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2025-02-10T07:40:50Z D! [outputs.influxdb] Wrote batch of 80 metrics in 1.959275ms
2025-02-10T07:40:50Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2025-02-10T07:41:00Z D! [outputs.influxdb] Wrote batch of 97 metrics in 2.197399ms
2025-02-10T07:41:00Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2025-02-10T07:41:10Z D! [outputs.influxdb] Wrote batch of 68 metrics in 2.081932ms
2025-02-10T07:41:10Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2025-02-10T07:41:20Z D! [outputs.influxdb] Wrote batch of 91 metrics in 1.938267ms
2025-02-10T07:41:20Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2025-02-10T07:41:30Z D! [outputs.influxdb] Wrote batch of 72 metrics in 7.884534ms
2025-02-10T07:41:30Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics

System info

Telegraf 1.33.0, Debian 12 LXC in Proxmox,

Docker

On container start, Telegraf collects metrics from Chrony (on a different machine) without issue for anywhere from a couple hours to a couple days. Then at some point it stops collecting data and I get the log shown above. Only way to get collection going again is to restart the Telegraf service. Note that collection interval is 3 seconds, which is pretty fast. However, once a single metric pull times out, it never recovers. I don't have this issue with any other input plugins I'm using (mem, processes, mqtt_consumer).

Steps to reproduce

  1. Use the same interval I did
  2. Pull chrony data from a different machine
  3. Cut the network connection to the different machine and after 10 seconds, re-attach
  4. Look at the logs.
    ...

Expected behavior

Chrony data collection resumes once re-connection happens and waiting metrics time out eventually and are dropped.

Actual behavior

Chrony waits for the delayed metrics forever and skips all subsequent collections.

Additional info

No response

@pkkrusty pkkrusty added the bug unexpected problem or unintended behavior label Feb 10, 2025
@pkkrusty
Copy link
Author

Also note that a service restart doesn't actually work. Systemd eventually kills the process after it times out after 90 seconds.

Feb 10 07:38:59 lxc-telegraf-influxdb-grafana systemd[1]: Stopping telegraf.service - Telegraf...
-- Subject: A stop job for unit telegraf.service has begun execution
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- 
-- A stop job for unit telegraf.service has begun execution.
-- 
-- The job identifier is 11473.
Feb 10 07:40:29 lxc-telegraf-influxdb-grafana systemd[1]: telegraf.service: State 'stop-sigterm' timed out. Killing.
Feb 10 07:40:29 lxc-telegraf-influxdb-grafana systemd[1]: telegraf.service: Killing process 7198 (telegraf) with signal SIGKILL.
Feb 10 07:40:29 lxc-telegraf-influxdb-grafana systemd[1]: telegraf.service: Killing process 7199 (telegraf) with signal SIGKILL.
Feb 10 07:40:29 lxc-telegraf-influxdb-grafana systemd[1]: telegraf.service: Main process exited, code=killed, status=9/KILL
-- Subject: Unit process exited
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- 
-- An ExecStart= process belonging to unit telegraf.service has exited.
-- 
-- The process' exit code is 'killed' and its exit status is 9.
Feb 10 07:40:29 lxc-telegraf-influxdb-grafana systemd[1]: telegraf.service: Failed with result 'timeout'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- 
-- The unit telegraf.service has entered the 'failed' state with result 'timeout'.
Feb 10 07:40:29 lxc-telegraf-influxdb-grafana systemd[1]: Stopped telegraf.service - Telegraf.
-- Subject: A stop job for unit telegraf.service has finished
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- 
-- A stop job for unit telegraf.service has finished.
-- 
-- The job identifier is 11473 and the job result is done.
Feb 10 07:40:29 lxc-telegraf-influxdb-grafana systemd[1]: telegraf.service: Consumed 1h 23min 58.482s CPU time.
-- Subject: Resources consumed by unit runtime
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- 
-- The unit telegraf.service completed and consumed the indicated resources.

@pkkrusty pkkrusty changed the title [chrony] Chrony plugin fails to report after random period and never recovers [chrony] Chrony plugin fails to report after connection timeout and never recovers Feb 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

1 participant