Description
Environment: Mac OS, RHEL 8 (doesn't matter)
Configure traces and/or metrics GRPC exporter with invalid collector URL
# init tracing
trace_provider = TracerProvider(resource=resource)
trace_exporter = create_trace_exporter(exporter, otlp_config, jaeger_config)
trace_processor = BatchSpanProcessor(trace_exporter)
trace_provider.add_span_processor(trace_processor)
telemetry_trace.set_tracer_provider(trace_provider)
# init metrics
global _telemetry_meter # pylint: disable=global-statement
metrics_exporter = create_metrics_exporter(exporter, otlp_config)
metric_reader = PeriodicExportingMetricReader(
metrics_exporter, export_interval_millis=export_interval, export_timeout_millis=3000
)
metrics_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
telemetry_metrics.set_meter_provider(metrics_provider)
_telemetry_meter = telemetry_metrics.get_meter_provider().get_meter(service_name, str(service_version))
What is the expected behavior?
When executing:
meter_provider = telemetry_metrics.get_meter_provider()
if meter_provider and (shutdown_meter := getattr(meter_provider, 'shutdown', None)):
shutdown_meter(timeout_millis=3000)
it should shutdown in about 3 seconds
What is the actual behavior?
It takes ~60 seconds to shutdown with logs like
WARNING - opentelemetry.exporter.otlp.proto.grpc.exporter::_export:363 | Transient error StatusCode.UNAVAILABLE encountered while exporting metrics, retrying in 8s.
Additional context
From what it looks like both metrics and traces exporters use this base/mixin which has a timeout_millis
Now going level up where OTLPMetricExporter.shudown
calls it correctly
def shutdown(self, timeout_millis: float = 30_000, **kwargs) -> None:
OTLPExporterMixin.shutdown(self, timeout_millis=timeout_millis)
However, PeriodicExportingMetricReader.shutdown
calls OTLPMetricExporter.shutdown
with a different kwarg name which seems to be completely ignored
def shutdown(self, timeout_millis: float = 30_000, **kwargs) -> None:
deadline_ns = time_ns() + timeout_millis * 10**6
def _shutdown():
self._shutdown = True
did_set = self._shutdown_once.do_once(_shutdown)
if not did_set:
_logger.warning("Can't shutdown multiple times")
return
self._shutdown_event.set()
if self._daemon_thread:
self._daemon_thread.join(
timeout=(deadline_ns - time_ns()) / 10**9
)
self._exporter.shutdown(timeout=(deadline_ns - time_ns()) / 10**6) # <--- timeout vs timeout_millis
- given the use of
time_ns()
if correct kwarg name it would lead to the error of negative timeout value being supplied
As for traces, exporter calls OTLPExporterMixin.shutdown
without propagating any timeouts at all.
This leads to some bad behaviour when combined with k8s and async application since timeoutless thread lock blocks event loop and also leads to hanging containers in k8s cluster until they are killed.