Skip to content

Latest commit

 

History

History

agent

agent

An example program-under-observation instrumented with the OpenTelemetry Java agent.

Overview

The tech stack in this subproject:

  • A program-under-observation
    • This is a fictional "data processing" program written in Java. This program is instrumented with the OpenTelemetry Java agent.
  • A metrics sink/collector (Telegraf)
    • Telegraf acts as a sink for the metrics pushed by the OpenTelemetry agent. Telegraf re-formats the metrics into an acceptable format for the metrics database and then writes the metrics into the database. Telegraf is acting as "collector" in the OpenTelemetry terminology.
  • A metrics database (InfluxDB)
    • InfluxDB is an open source time series database that's usually used for metrics. Prometheus is an even more popular alternative. There are many vendor options, too, like Datadog.

OpenTelemetry defines a protocol and conventions, and as such it comes with a lot of client libraries that implement the protocol and conventions for metric-creation and metric-collection, but OpenTelemetry doesn't replace the database or visualization tool. Remember, OpenTelemetry is not a complete observability stack.

While OpenTelemetry operates in the realm of metrics, logs, and spans, I'm going to only implement a metrics example.

Instructions

Follow these instructions to build and run the example system.

  1. Pre-requisites: Java and Docker
    • I used Java 21.
  2. Start infrastructure services
    • docker-compose up
    • This starts Telegraf and InfluxDB.
    • Pay attention to the output of these containers as they run. It's a tricky system to set up, and you'll want to know if there are any errors, like if Telegraf is unable to connect to InfluxDB.
  3. Download the OpenTelemetry Java agent
    • AGENT_URL="https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/download/v2.2.0/opentelemetry-javaagent.jar"
      curl --location --output opentelemetry-javaagent.jar "$AGENT_URL"
    • It's important that you use the --location (-L) flag because the GitHub URL redirects to some CDN URL at https://objects.githubusercontent.com/....
    • Note: in a production codebase, it would be better to handle agent-related things (URL config, downloading the agent, Java options) in the Gradle build. Unfortunately, the Gradle code required (e.g. https://stackoverflow.com/a/20968466) is a bit cryptic and distracting, so it's better to do these steps manually for the sake of clarity and "learning the core concepts" instead of learning Gradle.
  4. Build the program distribution
    • ./gradlew installDist
    • The distribution is in build/install/agent/. Notice the "start script" file at bin/agent. This script is generated by Gradle's built-in application plugin and the script provides extension points for us to some behavior. In particular, we'll use the environment variable JAVA_OPTS to set the -javaagent JVM option to instrument our program with the OpenTelemetry Java agent.
  5. Run the program with the agent
    • JAVA_OPTS="-javaagent:$(pwd)/opentelemetry-javaagent.jar -Dotel.javaagent.configuration-file=$(pwd)/open-telemetry.properties" ./build/install/agent/bin/agent
    • The program will run indefinitely and continuously submit OTLP-based metrics data to the Telegraf server.
  6. Inspect the metrics in InfluxDB directly
    • Start an influx session inside the InfluxDB container with the following command.
    • docker exec -it agent-influxdb-1 influx -precision rfc3339
    • The influx session may remind you of a SQL sessions. In it, you can run commands like SHOW DATABASES and SHOW MEASUREMENTS to explore the data. We named our database playground. You should be able to connect to it by issuing a use playground command. Then, execute a show measurements command, and hopefully it shows the following metrics that have flowed from our program through Telegraf and into the Influx database. It should look something like the following.
    • $ docker exec -it agent-influxdb-1 influx
      Connected to http://localhost:8086 version 1.8.10
      InfluxDB shell version: 1.8.10
      > use playground
      Using database playground
      > show measurements
      name: measurements
      name
      ----
      jvm.class.count
      jvm.class.loaded
      jvm.class.unloaded
      jvm.cpu.count
      jvm.cpu.recent_utilization
      jvm.cpu.time
      jvm.memory.committed
      jvm.memory.limit
      jvm.memory.used
      jvm.memory.used_after_last_gc
      jvm.thread.count
      queueSize
      
    • Let's inspect the memory usage over time for our "data processing" program. This is captured in the jvm.memory.used metric. Look at the below snippet for an example. The output shows the memory usage in MiB over time, and it represents a typical sawtooth pattern.
    • > SELECT SUM(gauge) / 1024 / 1024 AS "MiB" FROM "jvm.memory.used" WHERE "jvm.memory.type" = 'heap' GROUP BY time(10s)
      name: jvm.memory.used
      time                 MiB
      ----                 ---
      2024-03-30T19:38:00Z 49.924827575683594
      2024-03-30T19:38:10Z 26.35131072998047
      2024-03-30T19:38:20Z 31.070823669433594
      2024-03-30T19:38:30Z 35.52025604248047
      2024-03-30T19:38:40Z 40.108734130859375
      2024-03-30T19:38:50Z 44.671913146972656
      2024-03-30T19:39:00Z 49.11351776123047
      
  7. Stop the Java program
    • Press Ctrl+C to stop the program from the same terminal window where you ran the program.
  8. Stop the infrastructure services
    • docker-compose down
    • I think it's important to do a proper down command so that the network is cleaned up. Otherwise, you might experience some weirdness if you change the Docker Compose file and then try to bring the services back up. Not really sure.

Wish List

General clean-ups, TODOs and things I wish to implement for this project:

  • DONE Scaffold
  • DONE Do some "hello world"-style task on an indefinite loop. We must use a non-daemon thread. Our goal is to observe memory and garbage collection metrics which are affected by this task.
  • DONE Download and wire in the OpenTelemetry Java agent
  • DONE Debug logs for the OpenTelemetry Java agent. I'm not sure if it's working.
  • DONE Use a properties file. The commandline options are getting too long. With a properties file, you can lay out properties neatly on individual lines, and you can use comments.
  • DONE Set up Telegraf and InfluxDB using Docker Compose
    • DONE Step down to Influx v1. v2 is trouble because of Flux.
  • SKIP (No, auto-instrumenting the JVM is pretty good) Consider upgrading from the "hello world"-style task to a more realistic task. Use some framework/library that is instrumented by the OpenTelemetry Java agent. Maybe as simple as a cron job scheduled with Quartz? My goal is to exercise the instrumentation for a third-party library and see what the quality of metrics is like (naming, volume, etc. I'm not even sure what I'm looking for exactly, still learning the "what" of OpenTelemetry).
    • I really want this now because the memory usage is so variable that I need to change the description of the output every time I visit this project. Whereas with something like a Quartz job, I can predict the metric pattern.
  • SKIP (For demo, I'd rather curl it from GitHub) Actually the agent is distributed in Maven Central? See this example project in open-telemetry/opentelemetry-java-examples.
  • DONE Forget Grafana? It's enough to just use the InfluxDB CLI to inspect the data. This is a more direct demo.
  • DONE Slow down the simulated data processing. I want a smoother memory usage line.
  • DONE More control and less verbosity with the memory. Do the same thing I just did in manual-instrumentation/.

Reference

  • OpenTelemetry docs: Automatic Instrumentation
    • This is what I'm using in this project.
  • OpenTelemetry: Semantic Conventions

    The benefit to using Semantic Conventions is in following a common naming scheme that can be standardized across a codebase, libraries, and platforms.

    • This is, to me, the strongest selling point in OpenTelemetry. Yet another specification can turn into "yet another abandoned specification on an ever-accumulating pile of noise". But, the sheer weight of OpenTelemetry and its adoption across vendors, libraries, marketing, and mind-share means that this "specification of conventions" has staying power. Good!
  • GitHub repo: influxdata/influxdb-observability

    This repository is a reference for converting observability signals (traces, metrics, logs) to/from a common InfluxDB schema.