-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Background
Two areas in the current skills lack adequate coverage, both of which come up repeatedly in deploy and incident workflows:
- Component lifecycle — listing, inspecting, and dropping components during deploys or when troubleshooting rogue packages
- Per-node operations — targeting individual cluster nodes rather than the cluster URL, which only exposes the load balancer
Problem 1: Component lifecycle management
When a deploy fails or a stale component is causing problems, the current skills don't provide enough coverage to diagnose or resolve the issue without manual intervention.
Gaps:
- No skill for listing installed components and their package names across the cluster
- No skill for dropping a specific component by package name
- No skill to check component status (installed vs. loading vs. errored)
- Package name mapping is unclear — it's not always obvious which deploy name maps to which component as Harper sees it
Skills to add or improve
List components
GET /components
Should return component name, package name, version, status, and whether it's running on all nodes or a subset.
Drop a component
DELETE /components/{componentName}
Skill should confirm the component name before executing and note which nodes it was removed from.
Inspect a component
GET /components/{componentName}
Returns config, status, and any load errors. Useful for diagnosing a component that deployed but isn't serving correctly.
Troubleshooting guide for rogue components
Add a skill or skill section that covers the pattern of: deploy succeeds, old component version still running, how to confirm the right version is active and force-remove the stale one.
Problem 2: Per-node cluster operations
Current skills target the cluster URL. That works for most operations, but incidents often require node-level visibility: a single node with high I/O, a node that didn't pick up a deploy, or a node stuck in an unexpected state.
Gaps:
- No skill for listing individual nodes in a cluster with their addresses
- No skill for health-checking a specific node directly (bypassing the load balancer)
- No skill for targeting a component drop or restart at a specific node
- When observability tools (Datadog, Grafana, logs) surface a specific node as the problem, there's no skill path that goes from "node name" to "node URL" to "targeted operation"
Skills to add or improve
List cluster nodes
GET /cluster
Should return each node's name, URL/address, role, and reachability status. Skill should surface the per-node URL so subsequent calls can target it directly.
Health check a specific node
GET {nodeUrl}/health (or equivalent)
Skill should accept a node URL (from the list above) and return its status independently of the cluster load balancer.
Target a component operation to a specific node
When dropping or restarting a component, the skill should support an optional nodeUrl parameter to scope the operation to a single node rather than the full cluster.
Node-to-URL resolution helper
A utility skill pattern that takes a node name (e.g., web-03, as it appears in Datadog or Grafana) and resolves it to the node's Harper API URL. This closes the gap between observability tool output and operational skills.
Context
These gaps surface most often in two scenarios:
- Deploy troubleshooting — Claude Code can execute a deploy but can't verify which component version is now running, or remove a stuck/rogue version without manual confirmation from the operator.
- Incident response — when observability tools identify a specific node as the problem, there's no skill path to take targeted action against that node; skills only expose the cluster endpoint.
Labels
skills, components, cluster, incident-response, deploy