Telemetry Spec #352

kbychu · 2025-09-11T17:53:38Z

Add telemetry spec for documentation and requirement discussion. Don't need to merge this but wanted to start writing things down.

apollo-librarian · 2025-09-11T17:53:47Z

✅ Docs preview ready

The preview is ready to be viewed. View the preview

File Changes

0 new, 4 changed, 0 removed

* (developer-tools)/apollo-mcp-server/(latest)/config-file.mdx
* (developer-tools)/apollo-mcp-server/(latest)/deploy.mdx
* (developer-tools)/apollo-mcp-server/(latest)/index.mdx
* (developer-tools)/apollo-mcp-server/(latest)/install.mdx

Build ID: 190b19c826d793a23fc800ec
Build Logs: View logs

URL: https://www.apollographql.com/docs/deploy-preview/190b19c826d793a23fc800ec

DaleSeo · 2025-09-12T17:43:21Z

@kbychu This PR might be a duplicate of #338

swcollard · 2025-09-12T20:00:00Z

specs/telemetry.md

+|                           | `apollo_mcp.schema.load_success` / `apollo_mcp.schema.load_failure`     | Counter         | schema_source                                                      | Schema load status                                                      | Must Have     |
+|                           | `apollo_mcp.schema.size`                                                | Gauge           | —                                                                  | # of types/fields in schema                                             | Should Have   |
+|                           | `apollo_mcp.version.info`                                               | Attribute/Event | server_version, schema_hash, manifest_version, manifest_source        | Server binary version, GraphQL schema hash, manifest version, manifest type (persisted_query/operation_collection) | Must Have     |
+| **Usage**                 | `apollo.mcp.calls`                                                      | Counter         | tool_name, success, error_code, client_type                       | Total tool invocations                                                  | Must Have     |


What are your thoughts for client_type? AFAIK, That data is unfortunately only available to us during the initialization call.

swcollard · 2025-09-15T17:33:49Z

specs/telemetry.md

+
+| Category                  | Metric / Trace / Event                                                   | Type            | Attributes                                                         | Notes                                                                   | Priority      |
+|---------------------------|--------------------------------------------------------------------------|-----------------|--------------------------------------------------------------------|-------------------------------------------------------------------------|---------------|
+| **Configuration**         | `apollo_mcp.config.load_success`                                        | Counter         | error_type                                                         | Successful config / startup loads                                       | Must Have     |


We have mostly been using apollo.mcp instead of apollo_mcp in our implementation so far as the namespacing.

lrlna

I think you all should have one more look over a few of the OTEL docs before you officially decide on how you're going to be aggregating this data. I think people will find a few things quite surprising with what's currently in this PR, from both OTEL point of view, as well as what we have in the router.

Docs that you might find helpful to read:

Naming conventions
HTTP metrics
System metrics
Gen AI metrics, while not specific to MCP, still provides good inspiration
- Gen AI events

lrlna · 2025-09-15T16:29:51Z

specs/telemetry.md

Any particular reason we are using apollo_mcp instead of apollo.mcp like we do with apollo.router (docs)?

lrlna · 2025-09-15T17:07:08Z

specs/telemetry.md

+| **Configuration**         | `apollo_mcp.config.load_success`                                        | Counter         | error_type                                                         | Successful config / startup loads                                       | Must Have     |
+|                           | `apollo_mcp.config.load_failure`                                        | Counter         | error_type                                                         | Failed startup (bad schema, manifest, endpoint)                         | Must Have     |


It's a bit strange to have two separate metrics for success and failure. You'd usually have a status attribute on a metric, like a apollo.mcp.config.load.count, similar to system.process.count described in otel docs. You can have an additional attribute of error for the error message or code for the error code.

lrlna · 2025-09-15T18:10:49Z

specs/telemetry.md

+| **Usage**                 | `apollo.mcp.calls`                                                      | Counter         | tool_name, success, error_code, client_type                       | Total tool invocations                                                  | Must Have     |
+|                           | `apollo.mcp.calls.latency`                                              | Histogram       | tool_name, success, error_code, client_type                       | End-to-end request latency                                              | Must Have     |


are you going to return the latency for all the calls, or a single call is going to be recorded as a single call in the histogram? If the latter, the semantic convention advises to use singular noun (docs).

With apollo.mcp.calls.latency, you should probably use apollo.mcp.call.duration, much like http.server.request.duration does (docs).

With apollo.mcp.calls counter, I am not sure if you need a counter, since you would have the apollo.mcp.call.duration histogram, which you could just do a count on, for example:

a single call is going to be recorded as a single call in the histogram

I'll update this to use the singular call

Agree that the count can be derived via the histogram. Does it follow convention to not have a standalone counter in that case?

lrlna · 2025-09-15T18:13:43Z

specs/telemetry.md

+|                           | `apollo.mcp.operation.calls`                                            | Counter         | tool_name, success, error_code, client_type, operation_name       | # of backend GraphQL operations executed                                | Must Have     |
+|                           | `apollo.mcp.operation.latency`                                          | Histogram       | tool_name, success, error_code, client_type, operation_name       | Latency of GraphQL backend call (excludes tool overhead)               | Must Have     |


Similar to the comment about apollo.mcp.calls.latency, this should be also be apollo.mcp.operation.duration, and you don't need a separate counter.

lrlna · 2025-09-15T18:16:38Z

specs/telemetry.md

+|                           | `apollo.mcp.calls.latency`                                              | Histogram       | tool_name, success, error_code, client_type                       | End-to-end request latency                                              | Must Have     |
+|                           | `apollo.mcp.operation.calls`                                            | Counter         | tool_name, success, error_code, client_type, operation_name       | # of backend GraphQL operations executed                                | Must Have     |
+|                           | `apollo.mcp.operation.latency`                                          | Histogram       | tool_name, success, error_code, client_type, operation_name       | Latency of GraphQL backend call (excludes tool overhead)               | Must Have     |
+|                           | `apollo_mcp.operation.type.mix`                                         | Counter         | query, mutation, subscription                                      | Breakdown of operation types                                            | Should Have   |


The operation type should live as an attribute on the apollo.mcp.operation.duration.

lrlna · 2025-09-15T18:36:15Z

specs/telemetry.md

+| **Events**                | `apollo_mcp.client.connected`                                           | Event           | client_type                                                        | Client connection established                                           | Should Have   |
+|                           | `apollo_mcp.client.disconnected`                                        | Event           | client_type                                                        | Client disconnected                                                     | Should Have   |


This could also be a gauge, so you can monitor the number of client connections at any given time. You can probably mirror http.client.open_connections here (docs). This can be called something like apollo.mcp.client.open_connections. I am not sure if you would want both a gauge and an event though.

lrlna · 2025-09-15T18:38:16Z

specs/telemetry.md

+|                           | Span: `serialization`                                                   | Trace           | size_bytes, latency                                               | Encoding/decoding JSON-RPC overhead                                     | Nice to Have  |
+| **Events**                | `apollo_mcp.client.connected`                                           | Event           | client_type                                                        | Client connection established                                           | Should Have   |
+|                           | `apollo_mcp.client.disconnected`                                        | Event           | client_type                                                        | Client disconnected                                                     | Should Have   |
+|                           | `apollo_mcp.config.reload`                                              | Event           | schema_source, version_hash                                        | Config/schema/manifest/collection reload                                | Nice to Have  |


How is this different from the first item on the list apollo.mcp.config.load? As in, if you wanted to track reload specifically, I'd add that as an attribute to to apollo.mcp.config.load and extract it out of that.

lrlna · 2025-09-15T18:52:32Z

specs/telemetry.md

+| **Resource Usage**        | `process.cpu.usage`                                                     | Gauge           | —                                                                  | CPU usage of MCP server process                                         | Nice to Have  |
+|                           | `process.memory.usage`                                                  | Gauge           | —                                                                  | Memory usage                                                            | Nice to Have  |


are you able to get this information directly in the mcp server implementation? that's pretty cool!

This I'm not sure of? We are using many of the same SDKs as the router and the process information might be better sourced from something like Kubernetes or wherever the system is hosted

lrlna · 2025-09-15T18:59:45Z

specs/telemetry.md

+| **Events**                | `apollo_mcp.client.connected`                                           | Event           | client_type                                                        | Client connection established                                           | Should Have   |
+|                           | `apollo_mcp.client.disconnected`                                        | Event           | client_type                                                        | Client disconnected                                                     | Should Have   |
+|                           | `apollo_mcp.config.reload`                                              | Event           | schema_source, version_hash                                        | Config/schema/manifest/collection reload                                | Nice to Have  |
+|                           | `apollo_mcp.auth.failed`                                                | Event           | client_type, reason                                                | Auth failure                                                            | Must Have     |


I would track this as an error type from http.server.request.duration I mentioned a few comments ago. But if you do want a specific event, I'd count both success and failures under a status attribute.

lrlna · 2025-09-15T19:04:10Z

specs/telemetry.md

+|                           | `apollo_mcp.auth.failed`                                                | Event           | client_type, reason                                                | Auth failure                                                            | Must Have     |
+| **Resource Usage**        | `process.cpu.usage`                                                     | Gauge           | —                                                                  | CPU usage of MCP server process                                         | Nice to Have  |
+|                           | `process.memory.usage`                                                  | Gauge           | —                                                                  | Memory usage                                                            | Nice to Have  |
+|                           | `network.bytes.sent` / `network.bytes.received`                         | Counter         | —                                                                  | Network traffic                                                         | Nice to Have  |


This is usually tracked under http.client|server.request|response.body.size. Or are you proposing a different sort of aggregation of body size?

No, we should be using the semantic conventions for http metrics and are already logging http.server.request.size, etc.

I'll update these to reflect that.

kbychu · 2025-09-15T21:50:02Z

@kbychu This PR might be a duplicate of #338

@DaleSeo absolutely, this is not meant to be a replicate of docs, but I wanted to share thoughts on all the potential telemetry we might want to have and have a place for discussion and things to reason about. I'll close this PR at some point in the near future.

kbychu · 2025-09-15T21:50:36Z

Thanks @lrlna!

swcollard · 2025-09-16T21:09:50Z

specs/telemetry.md

+|                           | `apollo.mcp.responses.characters`                                        | Histogram       | tool_name, client_type                                             | Character count of response payloads (proxy for token estimation)      | Nice to Have  |
+|                           | `apollo.mcp.clients.active`                                             | Gauge           | —                                                                  | # of active MCP clients                                                 | Must Have     |
+|                           | `apollo.mcp.concurrency.current_requests`                               | Gauge           | —                                                                  | # of concurrent tool executions                                         | Should Have   |
+| **Errors / Reliability**  | `apollo.mcp.requests.errors`                                            | Counter         | error_type, tool_name, client_type                                | Failed tool calls (generic catch-all)                                  | Must Have     |


I think these errors should live within the counters/durations on the toll calls rather than be a separate metric

Add telemetry requirements

f7c90f1

kbychu requested a review from a team as a code owner September 11, 2025 17:53

kbychu self-assigned this Sep 11, 2025

Add note about configuration metrics

a6e61ff

swcollard reviewed Sep 12, 2025

View reviewed changes

swcollard reviewed Sep 15, 2025

View reviewed changes

lrlna reviewed Sep 15, 2025

View reviewed changes

apollo_mcp to apollo.mcp, latency to duration, singular call

907a414

swcollard reviewed Sep 17, 2025

View reviewed changes

swcollard added 2 commits September 16, 2025 20:04

Fix typo, add http server metrics

3e65ee3

typo: Extra period

6524521

		\| Configuration \| `apollo_mcp.config.load_success` \| Counter \| error_type \| Successful config / startup loads \| Must Have \|
		\| \| `apollo_mcp.config.load_failure` \| Counter \| error_type \| Failed startup (bad schema, manifest, endpoint) \| Must Have \|

		\| Usage \| `apollo.mcp.calls` \| Counter \| tool_name, success, error_code, client_type \| Total tool invocations \| Must Have \|
		\| \| `apollo.mcp.calls.latency` \| Histogram \| tool_name, success, error_code, client_type \| End-to-end request latency \| Must Have \|

		\| \| `apollo.mcp.operation.calls` \| Counter \| tool_name, success, error_code, client_type, operation_name \| # of backend GraphQL operations executed \| Must Have \|
		\| \| `apollo.mcp.operation.latency` \| Histogram \| tool_name, success, error_code, client_type, operation_name \| Latency of GraphQL backend call (excludes tool overhead) \| Must Have \|

		\| Events \| `apollo_mcp.client.connected` \| Event \| client_type \| Client connection established \| Should Have \|
		\| \| `apollo_mcp.client.disconnected` \| Event \| client_type \| Client disconnected \| Should Have \|

		\| Resource Usage \| `process.cpu.usage` \| Gauge \| — \| CPU usage of MCP server process \| Nice to Have \|
		\| \| `process.memory.usage` \| Gauge \| — \| Memory usage \| Nice to Have \|

Telemetry Spec #352

Are you sure you want to change the base?

Telemetry Spec #352

Conversation

kbychu commented Sep 11, 2025

Uh oh!

apollo-librarian bot commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Docs preview ready

Uh oh!

DaleSeo commented Sep 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lrlna left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kbychu commented Sep 15, 2025

Uh oh!

kbychu commented Sep 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

apollo-librarian bot commented Sep 11, 2025 •

edited

Loading