-
Notifications
You must be signed in to change notification settings - Fork 42
Telemetry Spec #352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Telemetry Spec #352
Conversation
✅ Docs preview readyThe preview is ready to be viewed. View the preview File Changes 0 new, 4 changed, 0 removed
Build ID: 190b19c826d793a23fc800ec URL: https://www.apollographql.com/docs/deploy-preview/190b19c826d793a23fc800ec |
specs/telemetry.md
Outdated
| | `apollo_mcp.schema.load_success` / `apollo_mcp.schema.load_failure` | Counter | schema_source | Schema load status | Must Have | | ||
| | `apollo_mcp.schema.size` | Gauge | — | # of types/fields in schema | Should Have | | ||
| | `apollo_mcp.version.info` | Attribute/Event | server_version, schema_hash, manifest_version, manifest_source | Server binary version, GraphQL schema hash, manifest version, manifest type (persisted_query/operation_collection) | Must Have | | ||
| **Usage** | `apollo.mcp.calls` | Counter | tool_name, success, error_code, client_type | Total tool invocations | Must Have | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are your thoughts for client_type
? AFAIK, That data is unfortunately only available to us during the initialization call.
specs/telemetry.md
Outdated
|
||
| Category | Metric / Trace / Event | Type | Attributes | Notes | Priority | | ||
|---------------------------|--------------------------------------------------------------------------|-----------------|--------------------------------------------------------------------|-------------------------------------------------------------------------|---------------| | ||
| **Configuration** | `apollo_mcp.config.load_success` | Counter | error_type | Successful config / startup loads | Must Have | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have mostly been using apollo.mcp
instead of apollo_mcp
in our implementation so far as the namespacing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you all should have one more look over a few of the OTEL docs before you officially decide on how you're going to be aggregating this data. I think people will find a few things quite surprising with what's currently in this PR, from both OTEL point of view, as well as what we have in the router.
Docs that you might find helpful to read:
- Naming conventions
- HTTP metrics
- System metrics
- Gen AI metrics, while not specific to MCP, still provides good inspiration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason we are using apollo_mcp
instead of apollo.mcp
like we do with apollo.router
(docs)?
specs/telemetry.md
Outdated
| **Configuration** | `apollo_mcp.config.load_success` | Counter | error_type | Successful config / startup loads | Must Have | | ||
| | `apollo_mcp.config.load_failure` | Counter | error_type | Failed startup (bad schema, manifest, endpoint) | Must Have | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit strange to have two separate metrics for success
and failure
. You'd usually have a status
attribute on a metric, like a apollo.mcp.config.load.count
, similar to system.process.count
described in otel docs. You can have an additional attribute of error
for the error message or code
for the error code.
specs/telemetry.md
Outdated
| **Usage** | `apollo.mcp.calls` | Counter | tool_name, success, error_code, client_type | Total tool invocations | Must Have | | ||
| | `apollo.mcp.calls.latency` | Histogram | tool_name, success, error_code, client_type | End-to-end request latency | Must Have | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you going to return the latency for all the calls, or a single call is going to be recorded as a single call in the histogram? If the latter, the semantic convention advises to use singular noun (docs).
With apollo.mcp.calls.latency
, you should probably use apollo.mcp.call.duration
, much like http.server.request.duration
does (docs).
With apollo.mcp.calls
counter, I am not sure if you need a counter, since you would have the apollo.mcp.call.duration
histogram, which you could just do a count
on, for example:

There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a single call is going to be recorded as a single call in the histogram
I'll update this to use the singular call
Agree that the count can be derived via the histogram. Does it follow convention to not have a standalone counter in that case?
specs/telemetry.md
Outdated
| | `apollo.mcp.operation.calls` | Counter | tool_name, success, error_code, client_type, operation_name | # of backend GraphQL operations executed | Must Have | | ||
| | `apollo.mcp.operation.latency` | Histogram | tool_name, success, error_code, client_type, operation_name | Latency of GraphQL backend call (excludes tool overhead) | Must Have | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the comment about apollo.mcp.calls.latency
, this should be also be apollo.mcp.operation.duration
, and you don't need a separate counter.
specs/telemetry.md
Outdated
| | `apollo.mcp.calls.latency` | Histogram | tool_name, success, error_code, client_type | End-to-end request latency | Must Have | | ||
| | `apollo.mcp.operation.calls` | Counter | tool_name, success, error_code, client_type, operation_name | # of backend GraphQL operations executed | Must Have | | ||
| | `apollo.mcp.operation.latency` | Histogram | tool_name, success, error_code, client_type, operation_name | Latency of GraphQL backend call (excludes tool overhead) | Must Have | | ||
| | `apollo_mcp.operation.type.mix` | Counter | query, mutation, subscription | Breakdown of operation types | Should Have | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The operation type should live as an attribute on the apollo.mcp.operation.duration
.
specs/telemetry.md
Outdated
| **Events** | `apollo_mcp.client.connected` | Event | client_type | Client connection established | Should Have | | ||
| | `apollo_mcp.client.disconnected` | Event | client_type | Client disconnected | Should Have | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could also be a gauge, so you can monitor the number of client connections at any given time. You can probably mirror http.client.open_connections
here (docs). This can be called something like apollo.mcp.client.open_connections
. I am not sure if you would want both a gauge and an event though.
specs/telemetry.md
Outdated
| | Span: `serialization` | Trace | size_bytes, latency | Encoding/decoding JSON-RPC overhead | Nice to Have | | ||
| **Events** | `apollo_mcp.client.connected` | Event | client_type | Client connection established | Should Have | | ||
| | `apollo_mcp.client.disconnected` | Event | client_type | Client disconnected | Should Have | | ||
| | `apollo_mcp.config.reload` | Event | schema_source, version_hash | Config/schema/manifest/collection reload | Nice to Have | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this different from the first item on the list apollo.mcp.config.load
? As in, if you wanted to track reload specifically, I'd add that as an attribute to to apollo.mcp.config.load
and extract it out of that.
specs/telemetry.md
Outdated
| **Resource Usage** | `process.cpu.usage` | Gauge | — | CPU usage of MCP server process | Nice to Have | | ||
| | `process.memory.usage` | Gauge | — | Memory usage | Nice to Have | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you able to get this information directly in the mcp server implementation? that's pretty cool!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This I'm not sure of? We are using many of the same SDKs as the router and the process information might be better sourced from something like Kubernetes or wherever the system is hosted
specs/telemetry.md
Outdated
| **Events** | `apollo_mcp.client.connected` | Event | client_type | Client connection established | Should Have | | ||
| | `apollo_mcp.client.disconnected` | Event | client_type | Client disconnected | Should Have | | ||
| | `apollo_mcp.config.reload` | Event | schema_source, version_hash | Config/schema/manifest/collection reload | Nice to Have | | ||
| | `apollo_mcp.auth.failed` | Event | client_type, reason | Auth failure | Must Have | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would track this as an error type from http.server.request.duration
I mentioned a few comments ago. But if you do want a specific event, I'd count both success and failures under a status
attribute.
specs/telemetry.md
Outdated
| | `apollo_mcp.auth.failed` | Event | client_type, reason | Auth failure | Must Have | | ||
| **Resource Usage** | `process.cpu.usage` | Gauge | — | CPU usage of MCP server process | Nice to Have | | ||
| | `process.memory.usage` | Gauge | — | Memory usage | Nice to Have | | ||
| | `network.bytes.sent` / `network.bytes.received` | Counter | — | Network traffic | Nice to Have | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is usually tracked under http.client|server.request|response.body.size
. Or are you proposing a different sort of aggregation of body size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we should be using the semantic conventions for http metrics and are already logging http.server.request.size
, etc.
I'll update these to reflect that.
@DaleSeo absolutely, this is not meant to be a replicate of docs, but I wanted to share thoughts on all the potential telemetry we might want to have and have a place for discussion and things to reason about. I'll close this PR at some point in the near future. |
Thanks @lrlna! |
specs/telemetry.md
Outdated
| | `apollo.mcp.responses.characters` | Histogram | tool_name, client_type | Character count of response payloads (proxy for token estimation) | Nice to Have | | ||
| | `apollo.mcp.clients.active` | Gauge | — | # of active MCP clients | Must Have | | ||
| | `apollo.mcp.concurrency.current_requests` | Gauge | — | # of concurrent tool executions | Should Have | | ||
| **Errors / Reliability** | `apollo.mcp.requests.errors` | Counter | error_type, tool_name, client_type | Failed tool calls (generic catch-all) | Must Have | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these errors should live within the counters/durations on the toll calls rather than be a separate metric
Add telemetry spec for documentation and requirement discussion. Don't need to merge this but wanted to start writing things down.