Skip to content

Exporters shutdown takes longer then a minute when failing to send metrics/traces #3309

Closed
@Elli-Rid

Description

@Elli-Rid

Environment: Mac OS, RHEL 8 (doesn't matter)

Configure traces and/or metrics GRPC exporter with invalid collector URL

  # init tracing
  trace_provider = TracerProvider(resource=resource)
  trace_exporter = create_trace_exporter(exporter, otlp_config, jaeger_config)
  trace_processor = BatchSpanProcessor(trace_exporter)
  trace_provider.add_span_processor(trace_processor)
  telemetry_trace.set_tracer_provider(trace_provider)

  # init metrics
  global _telemetry_meter  # pylint: disable=global-statement
  metrics_exporter = create_metrics_exporter(exporter, otlp_config)
  metric_reader = PeriodicExportingMetricReader(
      metrics_exporter, export_interval_millis=export_interval, export_timeout_millis=3000
  )
  metrics_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
  telemetry_metrics.set_meter_provider(metrics_provider)
  _telemetry_meter = telemetry_metrics.get_meter_provider().get_meter(service_name, str(service_version))

What is the expected behavior?
When executing:

meter_provider = telemetry_metrics.get_meter_provider()
if meter_provider and (shutdown_meter := getattr(meter_provider, 'shutdown', None)):
    shutdown_meter(timeout_millis=3000)

it should shutdown in about 3 seconds

What is the actual behavior?
It takes ~60 seconds to shutdown with logs like

WARNING - opentelemetry.exporter.otlp.proto.grpc.exporter::_export:363 | Transient error StatusCode.UNAVAILABLE encountered while exporting metrics, retrying in 8s.

Additional context
From what it looks like both metrics and traces exporters use this base/mixin which has a timeout_millis

Now going level up where OTLPMetricExporter.shudown calls it correctly

    def shutdown(self, timeout_millis: float = 30_000, **kwargs) -> None:
        OTLPExporterMixin.shutdown(self, timeout_millis=timeout_millis)

However, PeriodicExportingMetricReader.shutdown calls OTLPMetricExporter.shutdown with a different kwarg name which seems to be completely ignored

    def shutdown(self, timeout_millis: float = 30_000, **kwargs) -> None:
        deadline_ns = time_ns() + timeout_millis * 10**6

        def _shutdown():
            self._shutdown = True

        did_set = self._shutdown_once.do_once(_shutdown)
        if not did_set:
            _logger.warning("Can't shutdown multiple times")
            return

        self._shutdown_event.set()
        if self._daemon_thread:
            self._daemon_thread.join(
                timeout=(deadline_ns - time_ns()) / 10**9
            )
        self._exporter.shutdown(timeout=(deadline_ns - time_ns()) / 10**6)  # <--- timeout vs timeout_millis
  • given the use of time_ns() if correct kwarg name it would lead to the error of negative timeout value being supplied

As for traces, exporter calls OTLPExporterMixin.shutdown without propagating any timeouts at all.

This leads to some bad behaviour when combined with k8s and async application since timeoutless thread lock blocks event loop and also leads to hanging containers in k8s cluster until they are killed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions