Skip to content

Add azuremonitormetrics toolset & azuremonitorlogs toolset (both toolsets for CLI only) #782

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 26 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
f5e654f
merge to main from Vishwa/azuremonitor-1 (#1)
vishiy Jul 16, 2025
eac1032
add azuremonitor alert investigation (#2)
vishiy Jul 16, 2025
791ad74
fetch alert query
vishiy Jul 17, 2025
e1fd01d
merge from vishwa/azuremonitor-1 branch to master (to get azmon runbo…
vishiy Jul 19, 2025
bdd3c21
Revert "merge from vishwa/azuremonitor-1 branch to master (to get azm…
vishiy Jul 19, 2025
1b74ca9
merge from vishwa/logs to master (#4)
vishiy Jul 19, 2025
e05d5d2
fix alert time and alert id
vishiy Jul 19, 2025
8302815
move runbooks from azuremontitor-1 branch to master
vishiy Jul 19, 2025
217035f
add missed dns prom runbook
vishiy Jul 20, 2025
313f2f3
optimize azure monitor logs cost
vishiy Jul 20, 2025
ac11bb1
Small change to alert output.
vishiy Jul 20, 2025
f8c43ac
fix analysis date for alerts
vishiy Jul 20, 2025
b19ab1a
fix missing alerts of same type.
vishiy Jul 22, 2025
db2d30d
update cost analysis runbook
vishiy Jul 23, 2025
5213f42
some runbook changes (#5)
austonli Jul 23, 2025
6be75d0
fix queries for cost to be case insensitive.
vishiy Jul 23, 2025
df5adfc
Merge remote-tracking branch 'upstream/master' into vishwa/merge
vishiy Aug 2, 2025
8870056
regenerate poetry.lock
vishiy Aug 2, 2025
1a80d53
disable azuremontorlogs toolset
vishiy Aug 3, 2025
6ce9e35
add unites for azuremonitormetrics
vishiy Aug 4, 2025
67d788a
Merge branch 'master' into vishwa/merge
vishiy Aug 4, 2025
71ea68f
PR feedbacks
vishiy Aug 5, 2025
3676c0c
Add comprehensive unit tests for Azure Monitor Metrics toolset
vishiy Aug 5, 2025
d8816c8
PR feedbacks
vishiy Aug 5, 2025
7edd5b1
Merge branch 'master' into vishwa/merge
vishiy Aug 5, 2025
4fb2bfb
Merge branch 'master' into vishwa/merge
vishiy Aug 5, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
387 changes: 387 additions & 0 deletions AZURE_MONITOR_SETUP_GUIDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,387 @@
# Azure Monitor Metrics Toolset Setup Guide

## Issue Description

When running HolmesGPT from outside an AKS cluster (such as from a local development environment), you may encounter this error:

```
Cannot determine if Azure Monitor metrics is enabled because this environment is not running inside an AKS cluster. Please run this from within your AKS cluster or provide cluster details.
```

This happens because the Azure Monitor metrics toolset is designed to auto-detect cluster configuration when running inside AKS pods, but fails when running externally.

## Solution

The Azure Monitor metrics toolset now has enhanced auto-detection capabilities using kubectl and Azure CLI. It can automatically discover AKS clusters from external environments.

### Step 1: Prerequisites

Ensure you have the following tools installed and configured:

1. **kubectl** - Connected to your AKS cluster
2. **Azure CLI** - Logged in with `az login`
3. **Correct Subscription Context** - Azure CLI must be set to the same subscription as your AKS cluster

**Critical Requirement - Subscription Context:**
The toolset uses Azure CLI to discover AKS cluster resource IDs. If your Azure CLI is set to a different subscription than your AKS cluster, auto-detection will fail.

**Verification Commands:**
```bash
# Check Azure CLI login status
az account show

# List all accessible subscriptions
az account list --output table

# Find your cluster's subscription (if unsure)
kubectl get nodes -o jsonpath='{.items[0].metadata.labels}' | grep -o 'subscriptions/[^/]*'

# Set correct subscription context
az account set --subscription <cluster-subscription-id>

# Verify kubectl connection to AKS cluster
kubectl config current-context
kubectl cluster-info
```

### Step 2: Automatic Detection (Recommended)

The toolset can now automatically detect your AKS cluster if kubectl is connected:

```yaml
toolsets:
azuremonitor-metrics:
auto_detect_cluster: true # Enable auto-detection via kubectl and Azure CLI
tool_calls_return_data: true
```

### Step 3: Manual Configuration (If Auto-Detection Fails)

If automatic detection doesn't work, configure manually:

```yaml
toolsets:
azuremonitor-metrics:
auto_detect_cluster: false
tool_calls_return_data: true
# Option 1: Provide full details
azure_monitor_workspace_endpoint: "https://your-workspace.prometheus.monitor.azure.com/"
cluster_name: "your-aks-cluster-name"
cluster_resource_id: "/subscriptions/xxx/resourceGroups/xxx/providers/Microsoft.ContainerService/managedClusters/xxx"
# Optional: Query performance tuning
default_step_seconds: 3600 # Default step size for range queries (1 hour)
min_step_seconds: 60 # Minimum allowed step size (1 minute)
max_data_points: 1000 # Maximum data points per query

# Option 2: Just provide cluster_resource_id (toolset will discover workspace)
# cluster_resource_id: "/subscriptions/xxx/resourceGroups/xxx/providers/Microsoft.ContainerService/managedClusters/xxx"
```

### How Auto-Detection Works

The enhanced detection mechanism uses multiple methods:

1. **kubectl Analysis**: Examines current kubectl context and cluster server URL
2. **Node Labels**: Checks AKS-specific node labels for cluster resource ID
3. **Azure CLI Integration**: Uses `az aks list` to find matching clusters
4. **Server URL Parsing**: Extracts cluster name and region from AKS API server URL

### Manual Configuration (If Needed)

If auto-detection fails, you can manually configure:

#### Method 1: Using Azure CLI

```bash
# List your AKS clusters
az aks list --output table

# Get specific cluster details
az aks show --resource-group <your-resource-group> --name <your-cluster-name> --query id --output tsv

# Example output:
# /subscriptions/12345678-1234-1234-1234-123456789012/resourceGroups/my-rg/providers/Microsoft.ContainerService/managedClusters/my-cluster
```

#### Method 2: Using kubectl

```bash
# Check current context
kubectl config current-context

# Get cluster info
kubectl cluster-info

# Check if connected to AKS (look for azmk8s.io in server URL)
kubectl config view --minify --output jsonpath='{.clusters[].cluster.server}'
```

#### Method 3: Update Configuration

Edit the `config.yaml` file with your cluster details:

```yaml
toolsets:
azuremonitor-metrics:
auto_detect_cluster: false
tool_calls_return_data: true
azure_monitor_workspace_endpoint: "https://myworkspace-abc123.prometheus.monitor.azure.com/"
cluster_name: "my-aks-cluster"
cluster_resource_id: "/subscriptions/12345678-1234-1234-1234-123456789012/resourceGroups/my-rg/providers/Microsoft.ContainerService/managedClusters/my-aks-cluster"
```

### Step 4: Verify kubectl Connection

Make sure kubectl is connected to your AKS cluster:

```bash
# Check current context
kubectl config current-context

# Test connection
kubectl get nodes

# Verify you're connected to AKS (server URL should contain azmk8s.io)
kubectl cluster-info
```

### Step 5: Ensure Azure Authentication

Make sure you have Azure credentials configured. The toolset uses Azure DefaultAzureCredential, which supports:

1. **Azure CLI** (recommended for local development):
```bash
az login
az account set --subscription <your-subscription-id>
```

2. **Environment Variables**:
```bash
export AZURE_CLIENT_ID="your-client-id"
export AZURE_CLIENT_SECRET="your-client-secret"
export AZURE_TENANT_ID="your-tenant-id"
export AZURE_SUBSCRIPTION_ID="your-subscription-id"
```

3. **Managed Identity** (when running in Azure)

### Step 6: Verify Required Permissions

Ensure your Azure credentials have the following permissions:

- **Reader** role on the AKS cluster resource
- **Reader** role on the Azure Monitor workspace
- **Monitoring Reader** role for querying metrics
- Permission to execute Azure Resource Graph queries

### Step 7: Test the Configuration

Run HolmesGPT again to test:

```bash
poetry run python3 holmes_cli.py ask "is azure monitor metrics enabled for this cluster?" --model="azure/gpt-4.1"
```

## Alternative: Simplified Configuration

If you only provide the cluster resource ID, the toolset will attempt to automatically discover the associated Azure Monitor workspace:

```yaml
toolsets:
azuremonitor-metrics:
auto_detect_cluster: false
cluster_resource_id: "/subscriptions/xxx/resourceGroups/xxx/providers/Microsoft.ContainerService/managedClusters/xxx"
```

This approach uses Azure Resource Graph queries to find the workspace configuration automatically.

## Troubleshooting

### "Azure Monitor managed Prometheus is not enabled"

This means your AKS cluster doesn't have Azure Monitor managed Prometheus enabled. Enable it using:

```bash
az aks update \
--resource-group <your-resource-group> \
--name <your-cluster-name> \
--enable-azure-monitor-metrics
```

### "Authentication failed"

1. Verify you're logged in to Azure CLI: `az account show`
2. Check you have the correct subscription selected: `az account set --subscription <id>`
3. Verify your permissions on the cluster and workspace resources

### "Cluster not found" or "No AKS cluster specified" (with kubectl connected)

**Cause:** Azure CLI subscription context doesn't match AKS cluster subscription

**Symptoms:**
- kubectl is connected and working
- Azure CLI is logged in
- But toolset can't find the cluster

**Solution:**
1. Find your cluster's subscription:
```bash
kubectl get nodes -o jsonpath='{.items[0].metadata.labels}' | grep -o 'subscriptions/[^/]*'
```

2. Check current Azure CLI subscription:
```bash
az account show --query id -o tsv
```

3. Set correct subscription context:
```bash
az account set --subscription <cluster-subscription-id>
```

4. Retry Holmes command:
```bash
poetry run python3 holmes_cli.py ask "get cluster resource id" --model="azure/gpt-4.1"
```

### "Query returned no results"

1. Verify the cluster name is correct
2. Check if metrics are actually being collected in Azure Monitor
3. Try disabling auto-cluster filtering temporarily:

```bash
poetry run python3 holmes_cli.py ask "run a prometheus query: up" --model="azure/gpt-4.1"
```

## Benefits of External Configuration

Running HolmesGPT externally (not in AKS) provides several advantages:

1. **Development Environment**: Test queries and troubleshooting from your local machine
2. **CI/CD Integration**: Include in automated pipelines for cluster health checks
3. **Multi-Cluster Support**: Configure multiple clusters and switch between them
4. **Enhanced Security**: Run with specific permissions rather than cluster-wide access

## Alert Investigation Workflow

The toolset supports a comprehensive two-step alert investigation workflow:

### Step 1: List Active Alerts

```bash
# Get a beautiful formatted list of all active alerts
poetry run python3 holmes_cli.py ask "show me all active azure monitor metric alerts" --model="azure/gpt-4.1"
```

This displays alerts with:
- **Visual formatting** with icons and colors
- **Alert type identification** (Prometheus Metric Alert)
- **Full Alert IDs** in code blocks for easy copying
- **Complete metadata** including queries, severity, and status

### Step 2: Investigate Specific Alert

```bash
# Investigate a specific alert using its full Alert ID
poetry run python3 holmes_cli.py investigate azuremonitormetrics /subscriptions/.../alerts/12345 --model="azure/gpt-4.1"
```

This provides:
- **AI-powered root cause analysis**
- **Focused investigation** of the specific alert
- **Correlation with cluster metrics and events**

## Example Usage After Configuration

Once configured, you can use Azure Monitor metrics queries and alert investigation:

```bash
# Check cluster health
poetry run python3 holmes_cli.py ask "what is the current resource utilization of this cluster?" --model="azure/gpt-4.1"

# List active alerts with beautiful formatting
poetry run python3 holmes_cli.py ask "show me all active azure monitor metric alerts" --model="azure/gpt-4.1"

# Investigate specific alert
poetry run python3 holmes_cli.py investigate azuremonitormetrics /subscriptions/.../alerts/84b776ef-64ae-3da4-1f14-cf02a24f0007 --model="azure/gpt-4.1"

# Investigate specific issues
poetry run python3 holmes_cli.py ask "show me pods with high memory usage in the last hour" --model="azure/gpt-4.1"

# Custom PromQL queries
poetry run python3 holmes_cli.py ask "run this prometheus query: container_cpu_usage_seconds_total" --model="azure/gpt-4.1"
```

The toolset will automatically add cluster filtering to ensure queries are scoped to your specific cluster.

## Built-in Diagnostic Runbooks

Azure Monitor alert investigation includes **automatic diagnostic runbooks** that guide the LLM through systematic troubleshooting steps - no setup required!

### Automatic Activation

```bash
# Runbooks work automatically - no configuration needed!
poetry run python3 holmes_cli.py investigate azuremonitormetrics /subscriptions/.../alerts/12345 --model="azure/gpt-4.1"

# The LLM automatically follows built-in diagnostic runbooks
```

### Optional: Custom Runbooks

For additional customization, you can add custom runbooks:

```bash
# Copy example runbooks for customization (optional)
cp examples/azuremonitor_runbooks.yaml ~/.holmes/runbooks.yaml

# Test with custom runbooks
poetry run python3 holmes_cli.py investigate azuremonitormetrics /subscriptions/.../alerts/12345 --model="azure/gpt-4.1"
```

### What Runbooks Provide

**Systematic Investigation:**
- 10-step diagnostic methodology for all Azure Monitor alerts
- Specialized workflows for CPU, memory, and pod-related alerts
- Comprehensive coverage from alert analysis to remediation recommendations

**Enhanced LLM Guidance:**
- Automatic tool selection and usage
- Structured analysis approach
- Consistent investigation quality
- Best practice recommendations

### Runbook Benefits

**For Generic Alerts:**
- Alert context analysis and resource identification
- Metric correlation and trend analysis
- Event timeline analysis and log examination
- Root cause hypothesis and impact assessment

**For Specific Alert Types:**
- **CPU Alerts**: Focus on throttling, performance, and scaling
- **Memory Alerts**: Emphasize leak detection and OOM analysis
- **Pod Alerts**: Concentrate on scheduling and configuration issues

### Configuration

Create custom runbooks for your environment:

```yaml
# ~/.holmes/runbooks.yaml
runbooks:
- match:
source_type: "azuremonitoralerts"
issue_name: ".*production.*"
instructions: >
Production alert investigation (high priority):
- Immediate impact assessment
- Check business SLAs and metrics
- Escalate if customer-facing
- Prepare incident communication
```

With runbooks configured, every Azure Monitor alert investigation becomes a guided, systematic process that ensures comprehensive analysis and faster resolution.
Loading