Skip to content

Conversation

kbychu
Copy link
Member

@kbychu kbychu commented Sep 11, 2025

Add telemetry spec for documentation and requirement discussion. Don't need to merge this but wanted to start writing things down.

@kbychu kbychu requested a review from a team as a code owner September 11, 2025 17:53
@apollo-librarian
Copy link

apollo-librarian bot commented Sep 11, 2025

✅ Docs preview ready

The preview is ready to be viewed. View the preview

File Changes

0 new, 4 changed, 0 removed
* (developer-tools)/apollo-mcp-server/(latest)/config-file.mdx
* (developer-tools)/apollo-mcp-server/(latest)/deploy.mdx
* (developer-tools)/apollo-mcp-server/(latest)/index.mdx
* (developer-tools)/apollo-mcp-server/(latest)/install.mdx

Build ID: 190b19c826d793a23fc800ec
Build Logs: View logs

URL: https://www.apollographql.com/docs/deploy-preview/190b19c826d793a23fc800ec

@kbychu kbychu self-assigned this Sep 11, 2025
@DaleSeo
Copy link
Contributor

DaleSeo commented Sep 12, 2025

@kbychu This PR might be a duplicate of #338

| | `apollo_mcp.schema.load_success` / `apollo_mcp.schema.load_failure` | Counter | schema_source | Schema load status | Must Have |
| | `apollo_mcp.schema.size` | Gauge | — | # of types/fields in schema | Should Have |
| | `apollo_mcp.version.info` | Attribute/Event | server_version, schema_hash, manifest_version, manifest_source | Server binary version, GraphQL schema hash, manifest version, manifest type (persisted_query/operation_collection) | Must Have |
| **Usage** | `apollo.mcp.calls` | Counter | tool_name, success, error_code, client_type | Total tool invocations | Must Have |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your thoughts for client_type? AFAIK, That data is unfortunately only available to us during the initialization call.


| Category | Metric / Trace / Event | Type | Attributes | Notes | Priority |
|---------------------------|--------------------------------------------------------------------------|-----------------|--------------------------------------------------------------------|-------------------------------------------------------------------------|---------------|
| **Configuration** | `apollo_mcp.config.load_success` | Counter | error_type | Successful config / startup loads | Must Have |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have mostly been using apollo.mcp instead of apollo_mcp in our implementation so far as the namespacing.

Copy link
Member

@lrlna lrlna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you all should have one more look over a few of the OTEL docs before you officially decide on how you're going to be aggregating this data. I think people will find a few things quite surprising with what's currently in this PR, from both OTEL point of view, as well as what we have in the router.

Docs that you might find helpful to read:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason we are using apollo_mcp instead of apollo.mcp like we do with apollo.router (docs)?

Comment on lines 5 to 6
| **Configuration** | `apollo_mcp.config.load_success` | Counter | error_type | Successful config / startup loads | Must Have |
| | `apollo_mcp.config.load_failure` | Counter | error_type | Failed startup (bad schema, manifest, endpoint) | Must Have |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit strange to have two separate metrics for success and failure. You'd usually have a status attribute on a metric, like a apollo.mcp.config.load.count, similar to system.process.count described in otel docs. You can have an additional attribute of error for the error message or code for the error code.

Comment on lines 17 to 18
| **Usage** | `apollo.mcp.calls` | Counter | tool_name, success, error_code, client_type | Total tool invocations | Must Have |
| | `apollo.mcp.calls.latency` | Histogram | tool_name, success, error_code, client_type | End-to-end request latency | Must Have |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you going to return the latency for all the calls, or a single call is going to be recorded as a single call in the histogram? If the latter, the semantic convention advises to use singular noun (docs).

With apollo.mcp.calls.latency, you should probably use apollo.mcp.call.duration, much like http.server.request.duration does (docs).

With apollo.mcp.calls counter, I am not sure if you need a counter, since you would have the apollo.mcp.call.duration histogram, which you could just do a count on, for example:

Screenshot 2025-09-15 at 19 09 49

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a single call is going to be recorded as a single call in the histogram

I'll update this to use the singular call

Agree that the count can be derived via the histogram. Does it follow convention to not have a standalone counter in that case?

Comment on lines 19 to 20
| | `apollo.mcp.operation.calls` | Counter | tool_name, success, error_code, client_type, operation_name | # of backend GraphQL operations executed | Must Have |
| | `apollo.mcp.operation.latency` | Histogram | tool_name, success, error_code, client_type, operation_name | Latency of GraphQL backend call (excludes tool overhead) | Must Have |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the comment about apollo.mcp.calls.latency, this should be also be apollo.mcp.operation.duration, and you don't need a separate counter.

| | `apollo.mcp.calls.latency` | Histogram | tool_name, success, error_code, client_type | End-to-end request latency | Must Have |
| | `apollo.mcp.operation.calls` | Counter | tool_name, success, error_code, client_type, operation_name | # of backend GraphQL operations executed | Must Have |
| | `apollo.mcp.operation.latency` | Histogram | tool_name, success, error_code, client_type, operation_name | Latency of GraphQL backend call (excludes tool overhead) | Must Have |
| | `apollo_mcp.operation.type.mix` | Counter | query, mutation, subscription | Breakdown of operation types | Should Have |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operation type should live as an attribute on the apollo.mcp.operation.duration.

Comment on lines 34 to 35
| **Events** | `apollo_mcp.client.connected` | Event | client_type | Client connection established | Should Have |
| | `apollo_mcp.client.disconnected` | Event | client_type | Client disconnected | Should Have |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also be a gauge, so you can monitor the number of client connections at any given time. You can probably mirror http.client.open_connections here (docs). This can be called something like apollo.mcp.client.open_connections. I am not sure if you would want both a gauge and an event though.

| | Span: `serialization` | Trace | size_bytes, latency | Encoding/decoding JSON-RPC overhead | Nice to Have |
| **Events** | `apollo_mcp.client.connected` | Event | client_type | Client connection established | Should Have |
| | `apollo_mcp.client.disconnected` | Event | client_type | Client disconnected | Should Have |
| | `apollo_mcp.config.reload` | Event | schema_source, version_hash | Config/schema/manifest/collection reload | Nice to Have |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different from the first item on the list apollo.mcp.config.load? As in, if you wanted to track reload specifically, I'd add that as an attribute to to apollo.mcp.config.load and extract it out of that.

Comment on lines 38 to 39
| **Resource Usage** | `process.cpu.usage` | Gauge | — | CPU usage of MCP server process | Nice to Have |
| | `process.memory.usage` | Gauge | — | Memory usage | Nice to Have |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you able to get this information directly in the mcp server implementation? that's pretty cool!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This I'm not sure of? We are using many of the same SDKs as the router and the process information might be better sourced from something like Kubernetes or wherever the system is hosted

| **Events** | `apollo_mcp.client.connected` | Event | client_type | Client connection established | Should Have |
| | `apollo_mcp.client.disconnected` | Event | client_type | Client disconnected | Should Have |
| | `apollo_mcp.config.reload` | Event | schema_source, version_hash | Config/schema/manifest/collection reload | Nice to Have |
| | `apollo_mcp.auth.failed` | Event | client_type, reason | Auth failure | Must Have |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would track this as an error type from http.server.request.duration I mentioned a few comments ago. But if you do want a specific event, I'd count both success and failures under a status attribute.

| | `apollo_mcp.auth.failed` | Event | client_type, reason | Auth failure | Must Have |
| **Resource Usage** | `process.cpu.usage` | Gauge | — | CPU usage of MCP server process | Nice to Have |
| | `process.memory.usage` | Gauge | — | Memory usage | Nice to Have |
| | `network.bytes.sent` / `network.bytes.received` | Counter | — | Network traffic | Nice to Have |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is usually tracked under http.client|server.request|response.body.size. Or are you proposing a different sort of aggregation of body size?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we should be using the semantic conventions for http metrics and are already logging http.server.request.size, etc.

I'll update these to reflect that.

@kbychu
Copy link
Member Author

kbychu commented Sep 15, 2025

@kbychu This PR might be a duplicate of #338

@DaleSeo absolutely, this is not meant to be a replicate of docs, but I wanted to share thoughts on all the potential telemetry we might want to have and have a place for discussion and things to reason about. I'll close this PR at some point in the near future.

@kbychu
Copy link
Member Author

kbychu commented Sep 15, 2025

Thanks @lrlna!

| | `apollo.mcp.responses.characters` | Histogram | tool_name, client_type | Character count of response payloads (proxy for token estimation) | Nice to Have |
| | `apollo.mcp.clients.active` | Gauge || # of active MCP clients | Must Have |
| | `apollo.mcp.concurrency.current_requests` | Gauge || # of concurrent tool executions | Should Have |
| **Errors / Reliability** | `apollo.mcp.requests.errors` | Counter | error_type, tool_name, client_type | Failed tool calls (generic catch-all) | Must Have |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these errors should live within the counters/durations on the toll calls rather than be a separate metric

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants