A stage-controlled LangGraph workflow for investigating failures, grounding them in source code, and preparing GitHub issues or PRs.
Modern incident debugging is fragmented:
- traces live in observability systems
- code lives in GitHub
- diagnosis lives in somebody's head
- remediation gets rewritten again as an issue or PR
TracePilot compresses that loop into a single stateful graph:
- initialize the run
- inspect traces and logs
- synthesize a diagnosis
- pull repo context only when it is justified
- prepare fix actions
- optionally create a GitHub issue or PR
- produce a structured trace tree and final result
TracePilot is a Python package centered around TracePilotGraph. It accepts a typed request, builds runtime context, queries Jaeger, reasons over evidence, fetches GitHub source context when needed, and can execute one bounded GitHub action.
- Investigate incidents from a trace ID or trace-oriented prompt.
- Search and normalize Jaeger traces and span logs.
- Decide whether the problem is source-code related before touching the repo.
- Extract GitHub blob URLs from logs and fetch targeted code context.
- Prepare issue and PR payloads from the investigation output.
- Execute a single GitHub issue or PR action through a controlled subgraph.
- Return a structured state object with diagnosis, evidence, timeline, code context, and action results.
START
|
v
initialize_run
|
v
observability_agent
|
v
diagnosis_synthesizer
|
+--> repo_context ----------+
| |
+--> skip_repo_context |
v
fix_action_preparation
|
+------------+------------+
| |
v v
github_action skip_github_action
| |
+------------+------------+
v
build_trace_tree
|
v
END
repo_contextonly runs when the diagnosis suggests code involvement or PR mode is requested.github_actiononly runs for issue/PR workflows.- each stage writes back into a shared typed
GraphState, keeping the run inspectable and testable
tracepilot/
graph.py # top-level graph orchestration
state/models.py # typed request, runtime, and graph state
nodes/ # stage wrappers
subgraphs/ # repo-context and GitHub-action subflows
services/ # Jaeger, GitHub, LLM, limits, credentials
tests/
test_graph.py
test_* # unit coverage across nodes, clients, subgraphs
e2e/ # runnable demo scripts
scripts/
run_graph.py
docker-compose.yml # local observability stack support
python3 -m venv .venv
source .venv/bin/activate
pip install -e .Use one model provider key:
export OPENAI_API_KEY=...
# or
export ANTHROPIC_API_KEY=...Optional service credentials:
export GITHUB_TOKEN=...
export TRACEPILOT_OBSERVABILITY_TOKEN=...
export TRACEPILOT_JAEGER_BASE_URL=http://localhost:16686python3 tests/e2e/run_graph_with_jaeger.pyThat script:
- waits for Jaeger
- emits a demo trace through OTLP
- runs
TracePilotGraph - prints the final structured state as JSON
python3 tests/e2e/run_repo_context_subgraph.py \
--git-url "https://github.com/<owner>/<repo>/blob/main/app.py#L42" \
--mode pr \
--message "Investigate the code path linked from the logs."from tracepilot import TracePilotGraph
from tracepilot.state import GraphState, RunRequest
state = GraphState(
request=RunRequest(
message="Investigate checkout latency service:checkout",
trace_id="your-trace-id",
requested_mode="diagnose",
)
)
result = TracePilotGraph().run(state)
print(result.diagnosis)
print(result.final_response)
print(result.github_result)TracePilot normalizes provider and model settings from either the request or environment:
- default provider:
openai - default OpenAI model:
gpt-4.1-mini - default Anthropic model:
claude-3-5-sonnet-20241022
Useful env vars:
TRACEPILOT_MODEL_PROVIDERTRACEPILOT_MODELTRACEPILOT_MODEL_TEMPERATURETRACEPILOT_MODEL_MAX_OUTPUT_TOKENS
The graph is built to stay controlled rather than open-ended.
- tool-call limits are resolved into execution limits
- GitHub creation limits default to one issue and one PR attempt
- repo-context reads are targeted around extracted file locations
- GitHub action execution is constrained to one selected payload
Unit tests cover the graph, clients, nodes, and subgraphs.
python3 -m unittest discover -s testsIf that fails locally, the likely cause is missing package dependencies such as langgraph or langchain_core.
TracePilot is not a chat wrapper around observability APIs. It is a staged investigation system with explicit routing:
- observability first
- source context only when earned
- GitHub mutation only when requested
- structured state all the way through
That makes it a better fit for serious debugging workflows than a free-form agent loop.