Improve skills: component lifecycle management and per-node cluster operations

## Background

Two areas in the current skills lack adequate coverage, both of which come up repeatedly in deploy and incident workflows:

1. **Component lifecycle** — listing, inspecting, and dropping components during deploys or when troubleshooting rogue packages
2. **Per-node operations** — targeting individual cluster nodes rather than the cluster URL, which only exposes the load balancer

---

## Problem 1: Component lifecycle management

When a deploy fails or a stale component is causing problems, the current skills don't provide enough coverage to diagnose or resolve the issue without manual intervention.

Gaps:
- No skill for listing installed components and their package names across the cluster
- No skill for dropping a specific component by package name
- No skill to check component status (installed vs. loading vs. errored)
- Package name mapping is unclear — it's not always obvious which deploy name maps to which component as Harper sees it

### Skills to add or improve

**List components**
```
GET /components
```
Should return component name, package name, version, status, and whether it's running on all nodes or a subset.

**Drop a component**
```
DELETE /components/{componentName}
```
Skill should confirm the component name before executing and note which nodes it was removed from.

**Inspect a component**
```
GET /components/{componentName}
```
Returns config, status, and any load errors. Useful for diagnosing a component that deployed but isn't serving correctly.

**Troubleshooting guide for rogue components**

Add a skill or skill section that covers the pattern of: deploy succeeds, old component version still running, how to confirm the right version is active and force-remove the stale one.

---

## Problem 2: Per-node cluster operations

Current skills target the cluster URL. That works for most operations, but incidents often require node-level visibility: a single node with high I/O, a node that didn't pick up a deploy, or a node stuck in an unexpected state.

Gaps:
- No skill for listing individual nodes in a cluster with their addresses
- No skill for health-checking a specific node directly (bypassing the load balancer)
- No skill for targeting a component drop or restart at a specific node
- When observability tools (Datadog, Grafana, logs) surface a specific node as the problem, there's no skill path that goes from "node name" to "node URL" to "targeted operation"

### Skills to add or improve

**List cluster nodes**
```
GET /cluster
```
Should return each node's name, URL/address, role, and reachability status. Skill should surface the per-node URL so subsequent calls can target it directly.

**Health check a specific node**
```
GET {nodeUrl}/health  (or equivalent)
```
Skill should accept a node URL (from the list above) and return its status independently of the cluster load balancer.

**Target a component operation to a specific node**
When dropping or restarting a component, the skill should support an optional `nodeUrl` parameter to scope the operation to a single node rather than the full cluster.

**Node-to-URL resolution helper**
A utility skill pattern that takes a node name (e.g., `web-03`, as it appears in Datadog or Grafana) and resolves it to the node's Harper API URL. This closes the gap between observability tool output and operational skills.

---

## Context

These gaps surface most often in two scenarios:

1. **Deploy troubleshooting** — Claude Code can execute a deploy but can't verify which component version is now running, or remove a stuck/rogue version without manual confirmation from the operator.
2. **Incident response** — when observability tools identify a specific node as the problem, there's no skill path to take targeted action against that node; skills only expose the cluster endpoint.

## Labels

`skills`, `components`, `cluster`, `incident-response`, `deploy`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve skills: component lifecycle management and per-node cluster operations #9

Background

Problem 1: Component lifecycle management

Skills to add or improve

Problem 2: Per-node cluster operations

Skills to add or improve

Context

Labels

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve skills: component lifecycle management and per-node cluster operations #9

Description

Background

Problem 1: Component lifecycle management

Skills to add or improve

Problem 2: Per-node cluster operations

Skills to add or improve

Context

Labels

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions