Skip to content

feat(telemetry): tokf telemetry sync — replay unsynced events to OTLP backend #89

@mpecan

Description

@mpecan

Summary

Add a tokf telemetry sync subcommand that reads invocation events from the local SQLite database and replays any that were not successfully exported to the OTLP backend in real time.

Motivation

The OTel exporter introduced in #85 uses a best-effort model: it waits at most 200 ms for the OTLP flush before giving up (see ADR-0001). Under a slow or temporarily unavailable endpoint, the last invocation's metrics may not reach the backend.

However, every invocation is always written to SQLite first, and the synced_to_otel_at column (added in #85) tracks which rows have been successfully exported. A sync command can replay those rows at any time — from a cron job, a CI post-step, or manually.

Design

Schema (already in place from #85)

-- events table already has:
synced_to_otel_at TEXT  -- NULL = not yet synced

Command

tokf telemetry sync [--since <ISO8601>] [--dry-run] [--limit N]
  • Queries WHERE synced_to_otel_at IS NULL (or --since override)
  • Builds raw OTLP HTTP export payloads with correct start_time_unix_nano / time_unix_nano from the stored timestamp column
  • POSTs directly to the configured OTLP endpoint (no SdkMeterProvider overhead, no background thread)
  • On success: updates synced_to_otel_at = strftime('%Y-%m-%dT%H:%M:%SZ','now')
  • --dry-run: prints what would be synced without sending
  • --limit N: sync at most N events (for incremental rollouts)

Temporality

Uses Delta temporality (matching the real-time exporter). Each event is a single-invocation delta, so replaying them as historical deltas is semantically correct.

Backend compatibility

  • Datadog, Grafana Mimir, New Relic, Honeycomb: accept historical OTLP with timestamps. ✓
  • Prometheus Pushgateway: pull-based, rejects historical data. ✗ (document limitation)

Implementation notes

  • No SdkMeterProvider — build ExportMetricsServiceRequest protobuf directly using opentelemetry-proto crate (or prost-generated types already pulled in by opentelemetry-otlp)
  • Sync is idempotent: re-running after a partial failure is safe
  • Should respect the same OTEL_EXPORTER_OTLP_* env vars as the real-time exporter

Acceptance Criteria

  • tokf telemetry sync replays all synced_to_otel_at IS NULL events
  • synced_to_otel_at updated in DB on successful export
  • --dry-run flag prints events without sending
  • --limit N limits batch size
  • Exit code 0 on success, 1 if any event failed to export
  • Works with existing otel-http feature; documents gRPC limitation
  • Unit tests for payload construction; integration test with local OTel Collector

Relation

Depends on: #85 (schema and synced_to_otel_at column already added)
Related: #87 (tokf telemetry status), #88 (docs)
Referenced in: ADR-0001, consequence "Option C remains open"

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions