Skip to content

Improve skills: component lifecycle management and per-node cluster operations #9

@hdbjeff

Description

@hdbjeff

Background

Two areas in the current skills lack adequate coverage, both of which come up repeatedly in deploy and incident workflows:

  1. Component lifecycle — listing, inspecting, and dropping components during deploys or when troubleshooting rogue packages
  2. Per-node operations — targeting individual cluster nodes rather than the cluster URL, which only exposes the load balancer

Problem 1: Component lifecycle management

When a deploy fails or a stale component is causing problems, the current skills don't provide enough coverage to diagnose or resolve the issue without manual intervention.

Gaps:

  • No skill for listing installed components and their package names across the cluster
  • No skill for dropping a specific component by package name
  • No skill to check component status (installed vs. loading vs. errored)
  • Package name mapping is unclear — it's not always obvious which deploy name maps to which component as Harper sees it

Skills to add or improve

List components

GET /components

Should return component name, package name, version, status, and whether it's running on all nodes or a subset.

Drop a component

DELETE /components/{componentName}

Skill should confirm the component name before executing and note which nodes it was removed from.

Inspect a component

GET /components/{componentName}

Returns config, status, and any load errors. Useful for diagnosing a component that deployed but isn't serving correctly.

Troubleshooting guide for rogue components

Add a skill or skill section that covers the pattern of: deploy succeeds, old component version still running, how to confirm the right version is active and force-remove the stale one.


Problem 2: Per-node cluster operations

Current skills target the cluster URL. That works for most operations, but incidents often require node-level visibility: a single node with high I/O, a node that didn't pick up a deploy, or a node stuck in an unexpected state.

Gaps:

  • No skill for listing individual nodes in a cluster with their addresses
  • No skill for health-checking a specific node directly (bypassing the load balancer)
  • No skill for targeting a component drop or restart at a specific node
  • When observability tools (Datadog, Grafana, logs) surface a specific node as the problem, there's no skill path that goes from "node name" to "node URL" to "targeted operation"

Skills to add or improve

List cluster nodes

GET /cluster

Should return each node's name, URL/address, role, and reachability status. Skill should surface the per-node URL so subsequent calls can target it directly.

Health check a specific node

GET {nodeUrl}/health  (or equivalent)

Skill should accept a node URL (from the list above) and return its status independently of the cluster load balancer.

Target a component operation to a specific node
When dropping or restarting a component, the skill should support an optional nodeUrl parameter to scope the operation to a single node rather than the full cluster.

Node-to-URL resolution helper
A utility skill pattern that takes a node name (e.g., web-03, as it appears in Datadog or Grafana) and resolves it to the node's Harper API URL. This closes the gap between observability tool output and operational skills.


Context

These gaps surface most often in two scenarios:

  1. Deploy troubleshooting — Claude Code can execute a deploy but can't verify which component version is now running, or remove a stuck/rogue version without manual confirmation from the operator.
  2. Incident response — when observability tools identify a specific node as the problem, there's no skill path to take targeted action against that node; skills only expose the cluster endpoint.

Labels

skills, components, cluster, incident-response, deploy

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions