diff --git a/AZURE_MONITOR_SETUP_GUIDE.md b/AZURE_MONITOR_SETUP_GUIDE.md new file mode 100644 index 000000000..e85a09f1a --- /dev/null +++ b/AZURE_MONITOR_SETUP_GUIDE.md @@ -0,0 +1,387 @@ +# Azure Monitor Metrics Toolset Setup Guide + +## Issue Description + +When running HolmesGPT from outside an AKS cluster (such as from a local development environment), you may encounter this error: + +``` +Cannot determine if Azure Monitor metrics is enabled because this environment is not running inside an AKS cluster. Please run this from within your AKS cluster or provide cluster details. +``` + +This happens because the Azure Monitor metrics toolset is designed to auto-detect cluster configuration when running inside AKS pods, but fails when running externally. + +## Solution + +The Azure Monitor metrics toolset now has enhanced auto-detection capabilities using kubectl and Azure CLI. It can automatically discover AKS clusters from external environments. + +### Step 1: Prerequisites + +Ensure you have the following tools installed and configured: + +1. **kubectl** - Connected to your AKS cluster +2. **Azure CLI** - Logged in with `az login` +3. **Correct Subscription Context** - Azure CLI must be set to the same subscription as your AKS cluster + +**Critical Requirement - Subscription Context:** +The toolset uses Azure CLI to discover AKS cluster resource IDs. If your Azure CLI is set to a different subscription than your AKS cluster, auto-detection will fail. + +**Verification Commands:** +```bash +# Check Azure CLI login status +az account show + +# List all accessible subscriptions +az account list --output table + +# Find your cluster's subscription (if unsure) +kubectl get nodes -o jsonpath='{.items[0].metadata.labels}' | grep -o 'subscriptions/[^/]*' + +# Set correct subscription context +az account set --subscription + +# Verify kubectl connection to AKS cluster +kubectl config current-context +kubectl cluster-info +``` + +### Step 2: Automatic Detection (Recommended) + +The toolset can now automatically detect your AKS cluster if kubectl is connected: + +```yaml +toolsets: + azuremonitor-metrics: + auto_detect_cluster: true # Enable auto-detection via kubectl and Azure CLI + tool_calls_return_data: true +``` + +### Step 3: Manual Configuration (If Auto-Detection Fails) + +If automatic detection doesn't work, configure manually: + +```yaml +toolsets: + azuremonitor-metrics: + auto_detect_cluster: false + tool_calls_return_data: true + # Option 1: Provide full details + azure_monitor_workspace_endpoint: "https://your-workspace.prometheus.monitor.azure.com/" + cluster_name: "your-aks-cluster-name" + cluster_resource_id: "/subscriptions/xxx/resourceGroups/xxx/providers/Microsoft.ContainerService/managedClusters/xxx" + # Optional: Query performance tuning + default_step_seconds: 3600 # Default step size for range queries (1 hour) + min_step_seconds: 60 # Minimum allowed step size (1 minute) + max_data_points: 1000 # Maximum data points per query + + # Option 2: Just provide cluster_resource_id (toolset will discover workspace) + # cluster_resource_id: "/subscriptions/xxx/resourceGroups/xxx/providers/Microsoft.ContainerService/managedClusters/xxx" +``` + +### How Auto-Detection Works + +The enhanced detection mechanism uses multiple methods: + +1. **kubectl Analysis**: Examines current kubectl context and cluster server URL +2. **Node Labels**: Checks AKS-specific node labels for cluster resource ID +3. **Azure CLI Integration**: Uses `az aks list` to find matching clusters +4. **Server URL Parsing**: Extracts cluster name and region from AKS API server URL + +### Manual Configuration (If Needed) + +If auto-detection fails, you can manually configure: + +#### Method 1: Using Azure CLI + +```bash +# List your AKS clusters +az aks list --output table + +# Get specific cluster details +az aks show --resource-group --name --query id --output tsv + +# Example output: +# /subscriptions/12345678-1234-1234-1234-123456789012/resourceGroups/my-rg/providers/Microsoft.ContainerService/managedClusters/my-cluster +``` + +#### Method 2: Using kubectl + +```bash +# Check current context +kubectl config current-context + +# Get cluster info +kubectl cluster-info + +# Check if connected to AKS (look for azmk8s.io in server URL) +kubectl config view --minify --output jsonpath='{.clusters[].cluster.server}' +``` + +#### Method 3: Update Configuration + +Edit the `config.yaml` file with your cluster details: + +```yaml +toolsets: + azuremonitor-metrics: + auto_detect_cluster: false + tool_calls_return_data: true + azure_monitor_workspace_endpoint: "https://myworkspace-abc123.prometheus.monitor.azure.com/" + cluster_name: "my-aks-cluster" + cluster_resource_id: "/subscriptions/12345678-1234-1234-1234-123456789012/resourceGroups/my-rg/providers/Microsoft.ContainerService/managedClusters/my-aks-cluster" +``` + +### Step 4: Verify kubectl Connection + +Make sure kubectl is connected to your AKS cluster: + +```bash +# Check current context +kubectl config current-context + +# Test connection +kubectl get nodes + +# Verify you're connected to AKS (server URL should contain azmk8s.io) +kubectl cluster-info +``` + +### Step 5: Ensure Azure Authentication + +Make sure you have Azure credentials configured. The toolset uses Azure DefaultAzureCredential, which supports: + +1. **Azure CLI** (recommended for local development): + ```bash + az login + az account set --subscription + ``` + +2. **Environment Variables**: + ```bash + export AZURE_CLIENT_ID="your-client-id" + export AZURE_CLIENT_SECRET="your-client-secret" + export AZURE_TENANT_ID="your-tenant-id" + export AZURE_SUBSCRIPTION_ID="your-subscription-id" + ``` + +3. **Managed Identity** (when running in Azure) + +### Step 6: Verify Required Permissions + +Ensure your Azure credentials have the following permissions: + +- **Reader** role on the AKS cluster resource +- **Reader** role on the Azure Monitor workspace +- **Monitoring Reader** role for querying metrics +- Permission to execute Azure Resource Graph queries + +### Step 7: Test the Configuration + +Run HolmesGPT again to test: + +```bash +poetry run python3 holmes_cli.py ask "is azure monitor metrics enabled for this cluster?" --model="azure/gpt-4.1" +``` + +## Alternative: Simplified Configuration + +If you only provide the cluster resource ID, the toolset will attempt to automatically discover the associated Azure Monitor workspace: + +```yaml +toolsets: + azuremonitor-metrics: + auto_detect_cluster: false + cluster_resource_id: "/subscriptions/xxx/resourceGroups/xxx/providers/Microsoft.ContainerService/managedClusters/xxx" +``` + +This approach uses Azure Resource Graph queries to find the workspace configuration automatically. + +## Troubleshooting + +### "Azure Monitor managed Prometheus is not enabled" + +This means your AKS cluster doesn't have Azure Monitor managed Prometheus enabled. Enable it using: + +```bash +az aks update \ + --resource-group \ + --name \ + --enable-azure-monitor-metrics +``` + +### "Authentication failed" + +1. Verify you're logged in to Azure CLI: `az account show` +2. Check you have the correct subscription selected: `az account set --subscription ` +3. Verify your permissions on the cluster and workspace resources + +### "Cluster not found" or "No AKS cluster specified" (with kubectl connected) + +**Cause:** Azure CLI subscription context doesn't match AKS cluster subscription + +**Symptoms:** +- kubectl is connected and working +- Azure CLI is logged in +- But toolset can't find the cluster + +**Solution:** +1. Find your cluster's subscription: + ```bash + kubectl get nodes -o jsonpath='{.items[0].metadata.labels}' | grep -o 'subscriptions/[^/]*' + ``` + +2. Check current Azure CLI subscription: + ```bash + az account show --query id -o tsv + ``` + +3. Set correct subscription context: + ```bash + az account set --subscription + ``` + +4. Retry Holmes command: + ```bash + poetry run python3 holmes_cli.py ask "get cluster resource id" --model="azure/gpt-4.1" + ``` + +### "Query returned no results" + +1. Verify the cluster name is correct +2. Check if metrics are actually being collected in Azure Monitor +3. Try disabling auto-cluster filtering temporarily: + +```bash +poetry run python3 holmes_cli.py ask "run a prometheus query: up" --model="azure/gpt-4.1" +``` + +## Benefits of External Configuration + +Running HolmesGPT externally (not in AKS) provides several advantages: + +1. **Development Environment**: Test queries and troubleshooting from your local machine +2. **CI/CD Integration**: Include in automated pipelines for cluster health checks +3. **Multi-Cluster Support**: Configure multiple clusters and switch between them +4. **Enhanced Security**: Run with specific permissions rather than cluster-wide access + +## Alert Investigation Workflow + +The toolset supports a comprehensive two-step alert investigation workflow: + +### Step 1: List Active Alerts + +```bash +# Get a beautiful formatted list of all active alerts +poetry run python3 holmes_cli.py ask "show me all active azure monitor metric alerts" --model="azure/gpt-4.1" +``` + +This displays alerts with: +- **Visual formatting** with icons and colors +- **Alert type identification** (Prometheus Metric Alert) +- **Full Alert IDs** in code blocks for easy copying +- **Complete metadata** including queries, severity, and status + +### Step 2: Investigate Specific Alert + +```bash +# Investigate a specific alert using its full Alert ID +poetry run python3 holmes_cli.py investigate azuremonitormetrics /subscriptions/.../alerts/12345 --model="azure/gpt-4.1" +``` + +This provides: +- **AI-powered root cause analysis** +- **Focused investigation** of the specific alert +- **Correlation with cluster metrics and events** + +## Example Usage After Configuration + +Once configured, you can use Azure Monitor metrics queries and alert investigation: + +```bash +# Check cluster health +poetry run python3 holmes_cli.py ask "what is the current resource utilization of this cluster?" --model="azure/gpt-4.1" + +# List active alerts with beautiful formatting +poetry run python3 holmes_cli.py ask "show me all active azure monitor metric alerts" --model="azure/gpt-4.1" + +# Investigate specific alert +poetry run python3 holmes_cli.py investigate azuremonitormetrics /subscriptions/.../alerts/84b776ef-64ae-3da4-1f14-cf02a24f0007 --model="azure/gpt-4.1" + +# Investigate specific issues +poetry run python3 holmes_cli.py ask "show me pods with high memory usage in the last hour" --model="azure/gpt-4.1" + +# Custom PromQL queries +poetry run python3 holmes_cli.py ask "run this prometheus query: container_cpu_usage_seconds_total" --model="azure/gpt-4.1" +``` + +The toolset will automatically add cluster filtering to ensure queries are scoped to your specific cluster. + +## Built-in Diagnostic Runbooks + +Azure Monitor alert investigation includes **automatic diagnostic runbooks** that guide the LLM through systematic troubleshooting steps - no setup required! + +### Automatic Activation + +```bash +# Runbooks work automatically - no configuration needed! +poetry run python3 holmes_cli.py investigate azuremonitormetrics /subscriptions/.../alerts/12345 --model="azure/gpt-4.1" + +# The LLM automatically follows built-in diagnostic runbooks +``` + +### Optional: Custom Runbooks + +For additional customization, you can add custom runbooks: + +```bash +# Copy example runbooks for customization (optional) +cp examples/azuremonitor_runbooks.yaml ~/.holmes/runbooks.yaml + +# Test with custom runbooks +poetry run python3 holmes_cli.py investigate azuremonitormetrics /subscriptions/.../alerts/12345 --model="azure/gpt-4.1" +``` + +### What Runbooks Provide + +**Systematic Investigation:** +- 10-step diagnostic methodology for all Azure Monitor alerts +- Specialized workflows for CPU, memory, and pod-related alerts +- Comprehensive coverage from alert analysis to remediation recommendations + +**Enhanced LLM Guidance:** +- Automatic tool selection and usage +- Structured analysis approach +- Consistent investigation quality +- Best practice recommendations + +### Runbook Benefits + +**For Generic Alerts:** +- Alert context analysis and resource identification +- Metric correlation and trend analysis +- Event timeline analysis and log examination +- Root cause hypothesis and impact assessment + +**For Specific Alert Types:** +- **CPU Alerts**: Focus on throttling, performance, and scaling +- **Memory Alerts**: Emphasize leak detection and OOM analysis +- **Pod Alerts**: Concentrate on scheduling and configuration issues + +### Configuration + +Create custom runbooks for your environment: + +```yaml +# ~/.holmes/runbooks.yaml +runbooks: + - match: + source_type: "azuremonitoralerts" + issue_name: ".*production.*" + instructions: > + Production alert investigation (high priority): + - Immediate impact assessment + - Check business SLAs and metrics + - Escalate if customer-facing + - Prepare incident communication +``` + +With runbooks configured, every Azure Monitor alert investigation becomes a guided, systematic process that ensures comprehensive analysis and faster resolution. diff --git a/README.md b/README.md index 0cb0d9363..1572607dd 100644 --- a/README.md +++ b/README.md @@ -28,26 +28,28 @@ HolmesGPT integrates with popular observability and cloud platforms. The followi | Data Source | Status | Notes | |-------------|--------|-------| -| [ArgoCD **ArgoCD**](https://robusta-dev.github.io/holmesgpt/data-sources/builtin-toolsets/argocd/) | ✅ | Get status, history and manifests and more of apps, projects and clusters | -| [AWS RDS **AWS RDS**](https://robusta-dev.github.io/holmesgpt/data-sources/builtin-toolsets/aws/) | ✅ | Fetch events, instances, slow query logs and more | -| [Confluence **Confluence**](https://robusta-dev.github.io/holmesgpt/data-sources/builtin-toolsets/confluence/) | ✅ | Private runbooks and documentation | -| [Coralogix Logs **Coralogix Logs**](https://robusta-dev.github.io/holmesgpt/data-sources/builtin-toolsets/coralogix-logs/) | ✅ | Retrieve logs for any resource | -| [Datetime **Datetime**](https://robusta-dev.github.io/holmesgpt/data-sources/builtin-toolsets/datetime/) | ✅ | Date and time-related operations | -| [Docker **Docker**](https://robusta-dev.github.io/holmesgpt/data-sources/builtin-toolsets/docker/) | ✅ | Get images, logs, events, history and more | -| [GitHub **GitHub**](https://robusta-dev.github.io/holmesgpt/data-sources/builtin-toolsets/github/) | 🟡 Beta | Remediate alerts by opening pull requests with fixes | -| [DataDog **DataDog**](https://robusta-dev.github.io/holmesgpt/data-sources/builtin-toolsets/datadog/) | 🟡 Beta | Fetches log data from datadog | -| [Loki **Grafana Loki**](https://robusta-dev.github.io/holmesgpt/data-sources/builtin-toolsets/grafanaloki/) | ✅ | Query logs for Kubernetes resources or any query | -| [Tempo **Grafana Tempo**](https://robusta-dev.github.io/holmesgpt/data-sources/builtin-toolsets/grafanatempo/) | ✅ | Fetch trace info, debug issues like high latency in application. | -| [Helm **Helm**](https://robusta-dev.github.io/holmesgpt/data-sources/builtin-toolsets/helm/) | ✅ | Release status, chart metadata, and values | -| [Internet **Internet**](https://robusta-dev.github.io/holmesgpt/data-sources/builtin-toolsets/internet/) | ✅ | Public runbooks, community docs etc | -| [Kafka **Kafka**](https://robusta-dev.github.io/holmesgpt/data-sources/builtin-toolsets/kafka/) | ✅ | Fetch metadata, list consumers and topics or find lagging consumer groups | -| [Kubernetes **Kubernetes**](https://robusta-dev.github.io/holmesgpt/data-sources/builtin-toolsets/kubernetes/) | ✅ | Pod logs, K8s events, and resource status (kubectl describe) | -| [NewRelic **NewRelic**](https://robusta-dev.github.io/holmesgpt/data-sources/builtin-toolsets/newrelic/) | 🟡 Beta | Investigate alerts, query tracing data | -| [OpenSearch **OpenSearch**](https://robusta-dev.github.io/holmesgpt/data-sources/builtin-toolsets/opensearch-status/) | ✅ | Query health, shard, and settings related info of one or more clusters| -| [Prometheus **Prometheus**](https://robusta-dev.github.io/holmesgpt/data-sources/builtin-toolsets/prometheus/) | ✅ | Investigate alerts, query metrics and generate PromQL queries | -| [RabbitMQ **RabbitMQ**](https://robusta-dev.github.io/holmesgpt/data-sources/builtin-toolsets/rabbitmq/) | ✅ | Info about partitions, memory/disk alerts to troubleshoot split-brain scenarios and more | -| [Robusta **Robusta**](https://robusta-dev.github.io/holmesgpt/data-sources/builtin-toolsets/robusta/) | ✅ | Multi-cluster monitoring, historical change data, user-configured runbooks, PromQL graphs and more | -| [Slab **Slab**](https://robusta-dev.github.io/holmesgpt/data-sources/builtin-toolsets/slab/) | ✅ | Team knowledge base and runbooks on demand | +| [ArgoCD **ArgoCD**](https://docs.robusta.dev/master/configuration/holmesgpt/toolsets/argocd.html) | ✅ | Get status, history and manifests and more of apps, projects and clusters | +| [AWS RDS **AWS RDS**](https://docs.robusta.dev/master/configuration/holmesgpt/toolsets/aws.html) | ✅ | Fetch events, instances, slow query logs and more | +| [Azure Monitor Metrics **Azure Monitor Metrics**](https://docs.robusta.dev/master/configuration/holmesgpt/toolsets/azuremonitor-metrics.html) | ✅ | Query Azure Monitor managed Prometheus metrics for AKS cluster analysis and troubleshooting. Supports investigating fired alerts. | +| [Azure Monitor Logs **Azure Monitor Logs**](https://docs.robusta.dev/master/configuration/holmesgpt/toolsets/azuremonitor-logs.html) | ✅ | Detect Azure Monitor Container Insights and provide Log Analytics workspace details for AKS cluster log analysis via Azure MCP server integration. | +| [Confluence **Confluence**](https://docs.robusta.dev/master/configuration/holmesgpt/toolsets/confluence.html) | ✅ | Private runbooks and documentation | +| [Coralogix Logs **Coralogix Logs**](https://docs.robusta.dev/master/configuration/holmesgpt/toolsets/coralogix_logs.html) | ✅ | Retrieve logs for any resource | +| [Datetime **Datetime**](https://docs.robusta.dev/master/configuration/holmesgpt/toolsets/datetime.html) | ✅ | Date and time-related operations | +| [Docker **Docker**](https://docs.robusta.dev/master/configuration/holmesgpt/toolsets/docker.html) | ✅ | Get images, logs, events, history and more | +| GitHub **GitHub** | 🟡 Beta | Remediate alerts by opening pull requests with fixes | +| DataDog **DataDog** | 🟡 Beta | Fetches log data from datadog | +| [Loki **Grafana Loki**](https://docs.robusta.dev/master/configuration/holmesgpt/toolsets/grafanaloki.html) | ✅ | Query logs for Kubernetes resources or any query | +| [Tempo **Grafana Tempo**](https://docs.robusta.dev/master/configuration/holmesgpt/toolsets/grafanatempo.html) | ✅ | Fetch trace info, debug issues like high latency in application. | +| [Helm **Helm**](https://docs.robusta.dev/master/configuration/holmesgpt/toolsets/helm.html) | ✅ | Release status, chart metadata, and values | +| [Internet **Internet**](https://docs.robusta.dev/master/configuration/holmesgpt/toolsets/internet.html) | ✅ | Public runbooks, community docs etc | +| [Kafka **Kafka**](https://docs.robusta.dev/master/configuration/holmesgpt/toolsets/kafka.html) | ✅ | Fetch metadata, list consumers and topics or find lagging consumer groups | +| [Kubernetes **Kubernetes**](https://docs.robusta.dev/master/configuration/holmesgpt/toolsets/kubernetes.html) | ✅ | Pod logs, K8s events, and resource status (kubectl describe) | +| NewRelic **NewRelic** | 🟡 Beta | Investigate alerts, query tracing data | +| [OpenSearch **OpenSearch**](https://docs.robusta.dev/master/configuration/holmesgpt/toolsets/opensearch.html) | ✅ | Query health, shard, and settings related info of one or more clusters| +| [Prometheus **Prometheus**](https://docs.robusta.dev/master/configuration/holmesgpt/toolsets/prometheus.html) | ✅ | Investigate alerts, query metrics and generate PromQL queries | +| [RabbitMQ **RabbitMQ**](https://docs.robusta.dev/master/configuration/holmesgpt/toolsets/rabbitmq.html) | ✅ | Info about partitions, memory/disk alerts to troubleshoot split-brain scenarios and more | +| [Robusta **Robusta**](https://docs.robusta.dev/master/configuration/holmesgpt/toolsets/robusta.html) | ✅ | Multi-cluster monitoring, historical change data, user-configured runbooks, PromQL graphs and more | +| [Slab **Slab**](https://docs.robusta.dev/master/configuration/holmesgpt/toolsets/slab.html) | ✅ | Team knowledge base and runbooks on demand | ### 🚀 End-to-End Automation @@ -57,6 +59,7 @@ HolmesGPT can fetch alerts/tickets to investigate from external systems, then wr |-------------------------|-----------|-------| | Slack | 🟡 Beta | [Demo.](https://www.loom.com/share/afcd81444b1a4adfaa0bbe01c37a4847) Tag HolmesGPT bot in any Slack message | | Prometheus/AlertManager | ✅ | Robusta SaaS or HolmesGPT CLI | +| Azure Monitor Alerts | ✅ | HolmesGPT CLI only - investigate Prometheus metric alerts | | PagerDuty | ✅ | HolmesGPT CLI only | | OpsGenie | ✅ | HolmesGPT CLI only | | Jira | ✅ | HolmesGPT CLI only | diff --git a/config.example.yaml b/config.example.yaml index f56f08e69..f01d8ae3b 100644 --- a/config.example.yaml +++ b/config.example.yaml @@ -27,3 +27,23 @@ # give the LLM explicit instructions how to investigate certain alerts # try adding runbooks to get better results on known alerts #custom_runbooks: ["examples/custom_runbooks.yaml"] + +# Azure Monitor toolsets configuration (for AKS cluster monitoring) +# Both toolsets are enabled by default and auto-detect configuration when running in AKS +#toolsets: +# azuremonitormetrics: +# auto_detect_cluster: true # Default: auto-detect AKS cluster and Azure Monitor workspace +# cache_duration_seconds: 1800 # Cache duration for Azure API calls (30 minutes) +# # Manual configuration (optional, for explicit setup): +# #azure_monitor_workspace_endpoint: "https://your-workspace.prometheus.monitor.azure.com/" +# #cluster_name: "your-aks-cluster-name" +# #cluster_resource_id: "/subscriptions/xxx/resourceGroups/xxx/providers/Microsoft.ContainerService/managedClusters/xxx" +# +# azuremonitorlogs: +# enabled: true # Required: toolset is disabled by default +# auto_detect_cluster: true # Default: auto-detect AKS cluster and Container Insights +# # Manual configuration (optional, for explicit setup): +# #cluster_name: "your-aks-cluster-name" +# #cluster_resource_id: "/subscriptions/xxx/resourceGroups/xxx/providers/Microsoft.ContainerService/managedClusters/xxx" +# #log_analytics_workspace_id: "12345678-1234-1234-1234-123456789012" +# #log_analytics_workspace_resource_id: "/subscriptions/xxx/resourcegroups/xxx/providers/microsoft.operationalinsights/workspaces/xxx" diff --git a/docs/toolsets/azuremonitor-logs.md b/docs/toolsets/azuremonitor-logs.md new file mode 100644 index 000000000..8e1828660 --- /dev/null +++ b/docs/toolsets/azuremonitor-logs.md @@ -0,0 +1,210 @@ +# Azure Monitor Logs Toolset + +## Overview + +The Azure Monitor Logs toolset detects Azure Monitor Container Insights configuration and provides Log Analytics workspace details for AKS cluster log analysis. This toolset **does not execute KQL queries directly** - instead, it provides workspace configuration details for external Azure MCP server integration. + +## Purpose + +- **Container Insights Detection**: Automatically detect if Azure Monitor Container Insights is enabled for AKS clusters +- **Workspace Discovery**: Extract Log Analytics workspace ID and full Azure resource ID +- **Stream Profiling**: Identify enabled log streams and map them to Log Analytics tables +- **Azure MCP Integration**: Provide configuration details for external Azure MCP server setup + +## Prerequisites + +### Azure Dependencies +```bash +pip install azure-identity azure-mgmt-resourcegraph azure-mgmt-resource +``` + +### Authentication +The toolset uses `DefaultAzureCredential` for authentication. Configure one of: + +- **Azure CLI** (recommended for development): `az login` +- **Managed Identity** (recommended for production in AKS) +- **Service Principal** (alternative method) + +### Required Permissions +- **Reader** role on AKS cluster resource +- **Reader** role on Log Analytics workspace +- **Reader** role on Data Collection Rules + +### AKS Requirements +- AKS cluster with Azure Monitor Container Insights enabled +- kubectl access to target cluster (for auto-detection) + +## Configuration + +!!! warning "Explicit Enablement Required" + The Azure Monitor Logs toolset is **disabled by default**. You must explicitly enable it in your configuration. + +### Basic Configuration (Required) +```yaml +toolsets: + azuremonitorlogs: + enabled: true # Required - toolset is disabled by default + auto_detect_cluster: true +``` + +### Advanced Configuration +```yaml +toolsets: + azuremonitorlogs: + enabled: true + auto_detect_cluster: true + cluster_name: "my-aks-cluster" + cluster_resource_id: "/subscriptions/.../managedClusters/my-cluster" + log_analytics_workspace_id: "12345678-1234-1234-1234-123456789012" + log_analytics_workspace_resource_id: "/subscriptions/.../workspaces/my-workspace" +``` + +## Available Tools + +### 1. check_aks_cluster_context +Checks if the current environment is running inside an AKS cluster. + +**Usage**: Automatically called when investigating AKS-related issues. + +### 2. get_aks_cluster_resource_id +Gets the full Azure resource ID of the current AKS cluster. + +**Usage**: Auto-detects cluster information for workspace discovery. + +### 3. check_azure_monitor_logs_enabled +Detects if Azure Monitor Container Insights is enabled and provides workspace details. + +**Usage**: Primary tool for Container Insights detection and Azure MCP configuration. + +**Returns**: +- Container Insights status +- Log Analytics workspace ID and resource ID +- Available log streams +- Data collection configuration +- Azure MCP server configuration guidance + +## Azure MCP Server Integration + +This toolset provides configuration details for Azure MCP server setup: + +### Workspace Configuration +- **Workspace ID** (GUID): For KQL query execution +- **Workspace Resource ID** (full path): For ARM API access +- **Cluster Filter**: Required `_ResourceId` value for query filtering + +### Log Stream Information +- **Available Streams**: ContainerLogV2, KubePodInventory, KubeEvents, etc. +- **Table Mapping**: Stream names to Log Analytics table names +- **Sample Queries**: Properly filtered KQL query examples + +## Critical KQL Query Requirements + +**ALL KQL queries executed via Azure MCP server MUST include cluster filtering:** + +```kql +| where _ResourceId == "/subscriptions/.../clusters/your-cluster" +``` + +### Common Log Analytics Tables +Based on detected streams: +- **ContainerLogV2**: Container stdout/stderr logs +- **KubePodInventory**: Pod metadata and status +- **KubeEvents**: Kubernetes events +- **KubeNodeInventory**: Node information +- **Perf**: Performance metrics +- **InsightsMetrics**: Additional metrics data + +## Example Workflows + +### Initial Setup Detection +```bash +holmes ask "Is Azure Monitor logs enabled for this cluster?" +``` + +Expected response includes: +- Container Insights enablement status +- Log Analytics workspace details +- Available log streams +- Azure MCP configuration guidance + +### Log Analysis Workflow +1. **Detection**: Use toolset to detect workspace configuration +2. **MCP Setup**: Configure Azure MCP server with detected details +3. **Querying**: Use Azure MCP server for actual KQL log queries + +### Stream Availability Check +```bash +holmes ask "What log data is available for this cluster?" +``` + +Returns available streams and corresponding Log Analytics tables. + +## Troubleshooting + +### Authentication Issues +``` +Error: DefaultAzureCredential failed to retrieve a token +``` +**Solution**: Verify Azure authentication configuration (CLI, managed identity, or service principal). + +### AKS Cluster Not Detected +``` +Error: Could not determine AKS cluster resource ID +``` +**Solutions**: +- Check kubectl connection: `kubectl config current-context` +- Verify Azure CLI login: `az account show` +- Manually specify cluster resource ID in configuration + +### Container Insights Not Found +``` +Error: Azure Monitor Container Insights (logs) is not enabled +``` +**Solutions**: +- Enable Container Insights in Azure portal +- Verify Data Collection Rules configuration +- Check Azure Resource Graph API permissions + +### Missing Dependencies +``` +ImportError: No module named 'azure.mgmt.resourcegraph' +``` +**Solution**: Install required packages: +```bash +pip install azure-mgmt-resourcegraph +``` + +## Debug Mode +Enable debug logging: +```bash +export HOLMES_LOG_LEVEL=DEBUG +holmes ask "check azure monitor logs status" +``` + +## Integration with Azure MCP Server + +1. **Use this toolset** to detect workspace configuration +2. **Configure Azure MCP server** with detected workspace details: + ```json + { + "workspace_id": "detected-workspace-guid", + "workspace_resource_id": "/subscriptions/.../workspaces/name", + "cluster_filter": "| where _ResourceId == \"cluster-resource-id\"" + } + ``` +3. **Execute KQL queries** via Azure MCP server with mandatory cluster filtering + +## Support + +For toolset-specific issues: +1. Verify Azure authentication and permissions +2. Check AKS cluster connectivity +3. Confirm Container Insights configuration +4. Test Azure Resource Graph API access + +For Azure MCP server issues, refer to Azure MCP server documentation. + +## Related Documentation +- [Azure Monitor Metrics Toolset](azuremonitor-metrics.md) +- [Azure Container Insights Documentation](https://docs.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-overview) +- [Azure Resource Graph Documentation](https://docs.microsoft.com/en-us/azure/governance/resource-graph/) diff --git a/docs/toolsets/azuremonitor-metrics.md b/docs/toolsets/azuremonitor-metrics.md new file mode 100644 index 000000000..8fec67f82 --- /dev/null +++ b/docs/toolsets/azuremonitor-metrics.md @@ -0,0 +1,600 @@ +# Azure Monitor Metrics Toolset + +## Overview + +The Azure Monitor Metrics toolset enables HolmesGPT to query Azure Monitor managed Prometheus metrics for AKS cluster analysis and troubleshooting. This toolset is designed to work from external environments (such as local development machines, CI/CD pipelines, or management servers) and connects to AKS clusters remotely via Azure APIs, providing filtered access to cluster-specific metrics. + +## Key Features + +- **Automatic AKS Detection**: Auto-discovers AKS cluster context and Azure resource ID +- **Azure Monitor Integration**: Seamlessly connects to Azure Monitor managed Prometheus +- **Cluster-Specific Filtering**: Automatically filters all queries by cluster name to ensure relevant results +- **PromQL Support**: Execute both instant and range queries using standard PromQL syntax +- **Secure Authentication**: Uses Azure DefaultAzureCredential for secure, credential-free authentication in AKS + +## Prerequisites + +### 1. AKS Cluster with Azure Monitor managed Prometheus + +Your AKS cluster must have Azure Monitor managed Prometheus enabled. This can be configured during cluster creation or added to existing clusters. + +**Enable via Azure CLI:** +```bash +az aks update \ + --resource-group myResourceGroup \ + --name myAKSCluster \ + --enable-azure-monitor-metrics \ + --azure-monitor-workspace-resource-id /subscriptions/{subscription-id}/resourceGroups/myResourceGroup/providers/microsoft.monitor/accounts/myAzureMonitorWorkspace +``` + +### 2. Azure Credentials and Subscription Context + +**Critical Requirements:** +1. **Azure CLI** - Logged in with `az login` +2. **Correct Subscription Context** - Azure CLI must be set to the same subscription as your AKS cluster +3. **kubectl** - Connected to your AKS cluster + +When running inside AKS, the toolset uses Managed Identity automatically. For external environments (local development, CI/CD), ensure Azure credentials are properly configured and the subscription context is correct. + +**Verification Commands:** +```bash +# Check Azure CLI login status +az account show + +# Verify current subscription matches your AKS cluster +az account list --output table + +# Set correct subscription if needed +az account set --subscription + +# Verify kubectl context +kubectl config current-context +``` + +**Why This Matters:** +The toolset uses Azure CLI to discover cluster resource IDs and must search within the correct subscription context. If your Azure CLI is set to a different subscription than your AKS cluster, auto-detection will fail. + +### 3. Required Permissions + +- **Reader** role on the AKS cluster resource +- **Reader** role on the Azure Monitor workspace +- **Monitoring Reader** role for querying metrics +- Access to execute Azure Resource Graph queries + +## Available Tools + +### `check_aks_cluster_context` +Verifies if the current environment is running inside an AKS cluster. + +**Usage:** +```bash +holmes ask "Am I running in an AKS cluster?" +``` + +### `get_aks_cluster_resource_id` +Retrieves the full Azure resource ID of the current AKS cluster. + +**Usage:** +```bash +holmes ask "What is the Azure resource ID of this cluster?" +``` + +### `check_azure_monitor_prometheus_enabled` +Checks if Azure Monitor managed Prometheus is enabled for the AKS cluster and retrieves workspace details. + +**Parameters:** +- `cluster_resource_id` (optional): Azure resource ID of the AKS cluster + +**Usage:** +```bash +holmes ask "Is Azure Monitor managed Prometheus enabled for this cluster?" +``` + +### `execute_azuremonitor_prometheus_query` +Executes instant PromQL queries against the Azure Monitor workspace. + +**Parameters:** +- `query` (required): The PromQL query to execute +- `description` (required): Description of what the query analyzes +- `auto_cluster_filter` (optional): Enable/disable automatic cluster filtering (default: true) + +**Usage:** +```bash +holmes ask "Query current CPU usage across all pods using Azure Monitor metrics" +``` + +### `get_active_prometheus_alerts` +Retrieves active/fired Prometheus metric alerts for the AKS cluster. + +**Parameters:** +- `cluster_resource_id` (optional): Azure resource ID of the AKS cluster +- `alert_id` (optional): Specific alert ID to investigate + +**Usage:** +```bash +holmes ask "Show all active Prometheus alerts for this cluster" +holmes ask "Get details for alert ID /subscriptions/.../alerts/12345" +``` + +### `execute_azuremonitor_prometheus_range_query` +Executes range PromQL queries for time-series data analysis. + +**Parameters:** +- `query` (required): The PromQL query to execute +- `description` (required): Description of what the query analyzes +- `start` (optional): Start time for the query range +- `end` (optional): End time for the query range +- `step` (required): Query resolution step width +- `output_type` (required): How to interpret results (Plain, Bytes, Percentage, CPUUsage) +- `auto_cluster_filter` (optional): Enable/disable automatic cluster filtering (default: true) + +**Usage:** +```bash +holmes ask "Show CPU usage trends over the last hour using Azure Monitor metrics" +``` + +## Configuration + +### Automatic Configuration + +The toolset can attempt to auto-discover AKS clusters using Azure credentials: + +```yaml +# ~/.holmes/config.yaml +toolsets: + azuremonitor-metrics: + auto_detect_cluster: true # Attempts auto-discovery +``` + +### Manual Configuration (Recommended) + +For reliable operation and explicit cluster targeting: + +```yaml +# ~/.holmes/config.yaml +toolsets: + azuremonitor-metrics: + azure_monitor_workspace_endpoint: "https://your-workspace.prometheus.monitor.azure.com/" + cluster_name: "your-aks-cluster-name" + cluster_resource_id: "/subscriptions/xxx/resourceGroups/xxx/providers/Microsoft.ContainerService/managedClusters/xxx" + auto_detect_cluster: false + tool_calls_return_data: true + # Optional: Query performance tuning + default_step_seconds: 3600 # Default step size for range queries (1 hour) + min_step_seconds: 60 # Minimum allowed step size (1 minute) + max_data_points: 1000 # Maximum data points per query +``` + +### Configuration Options Explained + +| Option | Type | Default | Description | +|--------|------|---------|-------------| +| `azure_monitor_workspace_endpoint` | string | None | Full URL to your Azure Monitor workspace Prometheus endpoint | +| `cluster_name` | string | None | Name of your AKS cluster (used for query filtering) | +| `cluster_resource_id` | string | None | Full Azure resource ID of your AKS cluster | +| `auto_detect_cluster` | boolean | true | Enable automatic cluster detection via kubectl/Azure CLI | +| `tool_calls_return_data` | boolean | true | Include raw Prometheus data in tool responses | +| `default_step_seconds` | integer | 3600 | Default step size for range queries (in seconds) | +| `min_step_seconds` | integer | 60 | Minimum allowed step size to prevent excessive data points | +| `max_data_points` | integer | 1000 | Maximum data points per query to prevent token limit issues | + +#### Query Performance Tuning + +The step size configuration helps manage query performance and token limits: + +- **`default_step_seconds`**: Used when no step is specified in range queries. 1 hour (3600s) provides good balance between detail and performance. +- **`min_step_seconds`**: Prevents overly granular queries that could return excessive data points and hit token limits. +- **`max_data_points`**: Automatically adjusts step size if a query would return too many data points. + +**Example scenarios:** +- **24-hour query with default 1-hour step**: Returns ~24 data points +- **24-hour query with 1-minute step**: Would return 1440 points, gets auto-adjusted to stay under `max_data_points` +- **1-hour query with 30-second step**: Gets adjusted to `min_step_seconds` (60s) minimum + +## Alert Investigation + +Holmes supports investigating Azure Monitor Prometheus metric alerts with a two-step workflow for better control and focused analysis. + +### Step 1: List Active Alerts + +First, get a list of all active alerts to see what's currently firing: + +```bash +holmes ask "show me all active azure monitor metric alerts" +``` + +This will display all active alerts with beautiful formatting, icons, and their full alert IDs, allowing you to select which specific alert to investigate. + +### Step 2: Investigate Specific Alert + +Investigate a specific alert by providing its full alert ID: + +```bash +holmes investigate azuremonitormetrics /subscriptions/12345/providers/Microsoft.AlertsManagement/alerts/abcd-1234 +``` + +This targeted approach allows you to: +1. See all available alerts at once with enhanced visual formatting +2. Choose which alert is most critical to investigate +3. Get focused AI-powered root cause analysis for the selected alert + +### Alert Information Displayed + +For each alert, Holmes shows with beautiful formatting and icons: +- **🔔 Alert Header**: Cluster name with visual indicator +- **📋 Alert ID**: Full Azure resource ID in code blocks for easy copying +- **🔬 Alert Type**: "Prometheus Metric Alert" for clear identification +- **⚡ Query**: The Prometheus metric query that triggered the alert +- **📝 Description**: Alert description and configuration +- **🎯 Status Line**: Severity, status, and fired time in one organized line +- **Visual Indicators**: Color-coded severity icons (🔴🟠🟡🔵) and status icons (🚨👁️✅) + +### Example Output + +``` +🔔 **Active Prometheus Alerts for Cluster: my-cluster** + +💡 **How to investigate:** Copy an Alert ID and run: + `holmes investigate azuremonitormetrics ` + +──────────────────────────────────────────────────────────────────────────────── + +**1. 🔴 High CPU Usage** 🚨 + 📋 **Alert ID:** `/subscriptions/.../alerts/12345` + 🔬 **Type:** Prometheus Metric Alert + ⚡ **Query:** `container_cpu_usage_seconds_total` + 📝 **Description:** Container CPU usage above 80% + 🎯 **Severity:** Critical | **State:** New | **Condition:** Fired + 🕒 **Fired Time:** 2025-01-15 17:30 UTC + +**2. 🟡 Memory Pressure** 🚨 + 📋 **Alert ID:** `/subscriptions/.../alerts/67890` + 🔬 **Type:** Prometheus Metric Alert + ⚡ **Query:** `container_memory_working_set_bytes` + 📝 **Description:** Container memory usage above 90% + 🎯 **Severity:** Warning | **State:** New | **Condition:** Fired + 🕒 **Fired Time:** 2025-01-15 17:25 UTC +``` + +### Visual Elements + +The alert listing includes professional visual elements: +- **🔔** Header with cluster identification +- **💡** Clear instructions for next steps +- **📋** Code blocks for easy Alert ID copying +- **🔬** Alert type identification +- **⚡** Query information for troubleshooting +- **🎯** Organized metadata display +- **Severity Icons**: 🔴 Critical, 🟠 Error, 🟡 Warning, 🔵 Info +- **Status Icons**: 🚨 New, 👁️ Acknowledged, ✅ Closed + +### Common Options + +```bash +# Basic investigation (recommended) +holmes investigate azuremonitormetrics /subscriptions/.../alerts/12345 + +# Save results to JSON file +holmes investigate azuremonitormetrics /subscriptions/.../alerts/12345 --json-output-file alert-analysis.json + +# Verbose output for debugging +holmes investigate azuremonitormetrics /subscriptions/.../alerts/12345 --verbose +``` + +## Common Use Cases + +### Resource Monitoring + +Query resource utilization metrics: + +```bash +holmes ask "Show current memory usage for all pods in this cluster" +holmes ask "Which nodes have high CPU utilization?" +holmes ask "Are there any pods with memory issues?" +``` + +### Application Health + +Monitor application-specific metrics: + +```bash +holmes ask "Check pod restart counts in the last hour" +holmes ask "Show deployment replica status" +holmes ask "Are there any failed pods?" +``` + +### Infrastructure Analysis + +Analyze cluster infrastructure: + +```bash +holmes ask "Check node status and conditions" +holmes ask "Show filesystem usage across nodes" +holmes ask "Monitor network traffic patterns" +``` + +### Troubleshooting + +Use for specific troubleshooting scenarios: + +```bash +holmes ask "Investigate high CPU usage in the frontend namespace" +holmes ask "Check for resource constraints causing pod evictions" +holmes ask "Analyze performance during the last deployment" +``` + +## Automatic Cluster Filtering + +All PromQL queries are automatically enhanced with cluster-specific filtering: + +**Original Query:** +```promql +container_cpu_usage_seconds_total +``` + +**Enhanced Query:** +```promql +container_cpu_usage_seconds_total{cluster="my-cluster-name"} +``` + +This ensures queries only return metrics for the current AKS cluster, avoiding confusion when multiple clusters send metrics to the same Azure Monitor workspace. + +## Common Metrics for AKS + +The toolset works with standard Prometheus metrics available in Azure Monitor: + +- `container_cpu_usage_seconds_total` - CPU usage by containers +- `container_memory_working_set_bytes` - Memory usage by containers +- `kube_pod_status_phase` - Pod status information +- `kube_node_status_condition` - Node health status +- `container_fs_usage_bytes` - Filesystem usage +- `kube_deployment_status_replicas` - Deployment replica status +- `container_network_receive_bytes_total` - Network ingress +- `container_network_transmit_bytes_total` - Network egress + +## Troubleshooting + +### "No AKS cluster specified" +- Provide cluster_resource_id parameter in queries +- Configure cluster details in config.yaml file +- Ensure Azure credentials have access to the target cluster +- See AZURE_MONITOR_SETUP_GUIDE.md for detailed configuration instructions + +### "Azure Monitor managed Prometheus is not enabled" +- Enable managed Prometheus in Azure portal or via CLI +- Verify data collection rule configuration +- Ensure cluster is associated with Azure Monitor workspace + +### "Query returned no results" +- Verify the metric exists in your cluster +- Check if cluster filtering is too restrictive +- Try disabling auto-cluster filtering temporarily + +### Authentication Issues +- Verify Azure credentials are properly configured +- Check required permissions on cluster and workspace +- Ensure Managed Identity is enabled for in-cluster execution + +## Security Considerations + +- Uses Azure Managed Identity when running in AKS for secure, keyless authentication +- Respects Azure RBAC permissions and access controls +- Read-only access to metrics data +- All queries are automatically scoped to the current cluster + +## Best Practices + +1. **Use Descriptive Queries**: Always provide meaningful descriptions for your PromQL queries +2. **Leverage Auto-Detection**: Let the toolset auto-discover cluster configuration when possible +3. **Time Range Awareness**: Use appropriate time ranges for range queries based on investigation needs +4. **Resource Scope**: Take advantage of automatic cluster filtering to focus on relevant metrics +5. **Error Handling**: Check toolset status before executing queries to ensure proper setup + +## Integration with Other Toolsets + +The Azure Monitor Metrics toolset complements other HolmesGPT toolsets: + +- **Kubernetes Toolset**: Combine metrics with pod logs and events +- **Bash Toolset**: Use kubectl commands alongside metric queries +- **Internet Toolset**: Research metric meanings and troubleshooting approaches + +## Example Investigation Workflow + +1. **Setup Verification:** + ```bash + holmes ask "Check if Azure Monitor metrics toolset is available" + ``` + +2. **Environment Discovery:** + ```bash + holmes ask "Am I running in an AKS cluster and is Azure Monitor enabled?" + ``` + +3. **Health Overview:** + ```bash + holmes ask "Show overall cluster health metrics" + ``` + +4. **Specific Investigation:** + ```bash + holmes ask "Investigate high CPU usage in the production namespace over the last 2 hours" + ``` + +5. **Root Cause Analysis:** + ```bash + holmes ask "Correlate CPU spikes with pod restart events" + ``` + +This workflow leverages the toolset's automatic setup and cluster filtering to provide focused, relevant insights for AKS troubleshooting scenarios. + +## Diagnostic Runbooks + +The Azure Monitor toolset includes comprehensive diagnostic runbooks that enhance the LLM's investigation capabilities. These runbooks provide systematic, step-by-step guidance for analyzing Azure Monitor alerts. + +### How Runbooks Work + +When investigating Azure Monitor alerts using `holmes investigate azuremonitormetrics `, the runbooks automatically: + +1. **Guide the LLM** through systematic diagnostic steps +2. **Ensure comprehensive coverage** of all relevant investigation areas +3. **Provide structured methodology** for root cause analysis +4. **Suggest appropriate tool usage** for each diagnostic step + +### Available Runbooks + +#### Generic Azure Monitor Runbook + +A comprehensive diagnostic runbook that applies to all Azure Monitor alerts: + +```yaml +runbooks: + - match: + source_type: "azuremonitoralerts" + instructions: > + 10-step systematic diagnostic approach covering: + - Alert context analysis + - Current state assessment + - Resource investigation + - Metric correlation and trends + - Event timeline analysis + - Log analysis + - Dependency analysis + - Root cause hypothesis + - Impact assessment + - Remediation recommendations +``` + +#### Specialized Runbooks + +**High CPU Usage Alerts:** +- Focuses on CPU-specific metrics and throttling +- Analyzes application performance patterns +- Provides scaling and capacity recommendations + +**Memory-Related Alerts:** +- Emphasizes memory leak detection +- Checks for OOM conditions +- Analyzes memory pressure impacts + +**Pod Waiting State Alerts:** +- Focuses on pod lifecycle and scheduling issues +- Checks resource availability and constraints +- Analyzes image and configuration problems + +### Configuring Runbooks + +#### Built-in Runbooks (Automatic) + +Azure Monitor diagnostic runbooks are **built into Holmes** and work automatically without any configuration: + +```bash +# Runbooks are automatically active - no setup required! +holmes investigate azuremonitormetrics +``` + +#### Optional: Using the Example Configuration + +For additional customization, you can also copy the provided example configuration: + +```bash +# Copy the example runbook configuration for customization +cp examples/azuremonitor_runbooks.yaml ~/.holmes/runbooks.yaml + +# Or merge with existing runbooks +cat examples/azuremonitor_runbooks.yaml >> ~/.holmes/runbooks.yaml +``` + +#### Custom Runbooks + +Create custom runbooks for your specific environment: + +```yaml +runbooks: + - match: + issue_name: ".*MyApplication.*" + source_type: "azuremonitoralerts" + instructions: > + Custom diagnostic steps for MyApplication alerts: + 1. Check application-specific metrics + 2. Verify database connectivity + 3. Analyze custom logs in /app/logs + 4. Check integration with external services +``` + +#### Configuration Location + +Runbooks can be configured in: +- `~/.holmes/runbooks.yaml` (user-specific) +- `./runbooks.yaml` (project-specific) +- Via `--runbooks-file` command line option + +### Runbook Matching + +Runbooks are matched based on: +- **source_type**: "azuremonitoralerts" for all Azure Monitor alerts +- **issue_name**: Pattern matching against alert names +- **Custom criteria**: Additional matching rules as needed + +**Priority:** More specific runbooks take precedence over generic ones. + +### Benefits + +**Enhanced Investigation Quality:** +- Systematic approach ensures nothing is missed +- Consistent methodology across all alerts +- Leverages best practices for each alert type + +**Improved Efficiency:** +- Faster time to resolution +- Reduced investigation overhead +- Clear next steps and recommendations + +**Knowledge Sharing:** +- Codifies expert knowledge in runbooks +- Consistent investigation approach across teams +- Easy to customize for specific environments + +### Example Usage + +```bash +# Investigate with automatic runbook guidance +holmes investigate azuremonitormetrics /subscriptions/.../alerts/12345 + +# The LLM will automatically: +# 1. Load the appropriate runbook +# 2. Follow systematic diagnostic steps +# 3. Use suggested tools and queries +# 4. Provide structured analysis and recommendations +``` + +### Customization Examples + +**Team-Specific Runbooks:** +```yaml +runbooks: + - match: + issue_name: ".*frontend.*" + source_type: "azuremonitoralerts" + instructions: > + Frontend application alert investigation: + - Check CDN and load balancer metrics + - Analyze user experience metrics + - Verify API gateway connectivity + - Check browser error rates +``` + +**Environment-Specific Runbooks:** +```yaml +runbooks: + - match: + issue_name: ".*production.*" + source_type: "azuremonitoralerts" + instructions: > + Production alert investigation (high priority): + - Immediate impact assessment + - Escalation to on-call team if severe + - Check business metrics and SLAs + - Prepare incident communication +``` + +This runbook system transforms Azure Monitor alert investigation from ad-hoc analysis to systematic, comprehensive diagnostics guided by proven methodologies. diff --git a/examples/azuremonitor_runbooks.yaml b/examples/azuremonitor_runbooks.yaml new file mode 100644 index 000000000..4f3cb0307 --- /dev/null +++ b/examples/azuremonitor_runbooks.yaml @@ -0,0 +1,255 @@ +runbooks: + # Generic diagnostic runbook for all Azure Monitor alerts + - match: + source_type: "azuremonitoralerts" + instructions: > + Perform comprehensive diagnostic analysis for this Azure Monitor alert using a systematic approach: + + 1. ALERT CONTEXT ANALYSIS: + - Extract and analyze the alert details: metric name, **PromQL query**, **rule description**, threshold, severity, and current state + - Identify the timeframe when the alert fired and duration + - **CRITICAL: Note the extracted PromQL query from the alert - this is the exact query that triggered the alert** + - **ESSENTIAL: Read the rule description to understand what the alert is monitoring and why it's important** + - Determine the affected resources (pods, nodes, services, namespaces) from the alert + - Understand what the alert is measuring and why it triggered based on both query and description + + 2. EXECUTE ALERT'S ORIGINAL QUERY FOR TIMELINE ANALYSIS: + - **MANDATORY: Use `execute_alert_promql_query` tool with the alert ID to run the exact query that triggered the alert** + - Start with 1-2 hour time range to see recent trends: `execute_alert_promql_query` with time_range "2h" + - Extend to longer periods (6h, 1d) if needed to identify patterns + - This shows you the exact metric behavior that caused the alert to fire + - Compare the timeline with the alert's fired time to understand the progression + + 3. CURRENT STATE ASSESSMENT: + - Use kubectl commands to check the current status of affected resources + - Use `execute_azuremonitor_prometheus_query` for instant/current values of the alert metric + - Compare current values with the alert threshold to see if issue persists + - Check if the alert is still active or has resolved + + 3. RESOURCE INVESTIGATION: + - Examine the health and status of affected pods, nodes, or services + - Check resource requests, limits, and actual utilization + - Look for recent changes in replica counts, node status, or resource allocation + - Identify any resource constraints or scheduling issues + + 4. METRIC CORRELATION AND TRENDS: + - Use `execute_azuremonitor_prometheus_range_query` for related metrics around the alert timeframe + - **Analyze the alert's query timeline first**, then expand to related metrics + - Query CPU: `rate(container_cpu_usage_seconds_total[5m])` with range queries + - Query Memory: `container_memory_working_set_bytes` with range queries + - Query Pod Status: `kube_pod_status_phase` to check pod states over time + - Query Deployment Status: `kube_deployment_status_replicas` vs `kube_deployment_spec_replicas` + - Use time ranges that cover the alert period plus some buffer (2h, 4h, 6h) + - Look for sudden spikes, gradual increases, or cyclical patterns + - **Timeline Correlation**: Compare all metric timelines with the alert fired time + + 4.1. SPECIFIC QUERY ANALYSIS TECHNIQUES: + - **For deployment alerts**: Query replica mismatches and pod creation/deletion patterns + - **For resource alerts**: Compare requests vs limits vs actual usage over time + - **For node alerts**: Check node conditions and resource pressure metrics + - **For application alerts**: Correlate with request rates, error rates, response times + - **Use step intervals** appropriate for the time range (60s for detailed, 300s for longer periods) + + 5. EVENT TIMELINE ANALYSIS: + - Check Kubernetes events around the alert firing time using kubectl + - Look for recent deployments, pod restarts, scaling events, or configuration changes + - Correlate timing of events with the alert onset to identify potential triggers + - Check for any failed operations or warning events + + 6. LOG ANALYSIS: + - Examine logs from affected pods and containers for error messages or warnings + - Look for application-specific errors, performance issues, or resource exhaustion messages + - Check system logs if the alert is infrastructure-related (node issues, etc.) + - Search for patterns that coincide with the alert timing + + 7. DEPENDENCY AND SERVICE ANALYSIS: + - If alert affects application pods, check dependent services and databases + - Verify network connectivity and service discovery functionality + - Check ingress controllers, load balancers, or external dependencies + - Analyze service mesh metrics if applicable + + 8. ROOT CAUSE HYPOTHESIS: + - Based on metrics, events, logs, and resource analysis, form clear hypotheses about the root cause + - Prioritize the most likely causes based on evidence strength + - Explain the chain of events that led to the alert condition + - Distinguish between symptoms and actual root causes + + 9. IMPACT ASSESSMENT: + - Determine what users or services are affected by this alert condition + - Assess the severity and scope of the impact + - Check if there are cascading effects on other systems or services + - Evaluate business impact if applicable + + 10. REMEDIATION RECOMMENDATIONS: + - Suggest immediate actions to resolve the alert condition if appropriate + - Recommend monitoring steps to verify resolution + - Propose preventive measures to avoid recurrence + - Identify any configuration changes or scaling actions needed + + Use available toolsets systematically: Azure Monitor Metrics for querying, Kubernetes for resource analysis, and Bash for kubectl commands. Present findings clearly with supporting data and specific next steps. + + # Specific runbooks for common alert patterns can be added here + # These will take precedence over the generic runbook above + + - match: + issue_name: ".*[Hh]igh [Cc][Pp][Uu].*" + source_type: "azuremonitoralerts" + instructions: > + This is a high CPU usage alert. Focus your diagnostic analysis on: + + 1. CPU-SPECIFIC ANALYSIS: + - Query CPU usage trends using container_cpu_usage_seconds_total and rate() functions + - Identify which specific pods/containers are consuming the most CPU + - Check CPU requests and limits vs actual usage + - Analyze CPU throttling metrics if available + + 2. APPLICATION PERFORMANCE: + - Look for application logs indicating performance issues or increased load + - Check for recent deployments that might have introduced performance regressions + - Analyze request rates and response times if this is a web application + - Look for resource-intensive operations or batch jobs + + 3. SCALING AND CAPACITY: + - Check if horizontal or vertical scaling is needed + - Analyze historical CPU patterns to determine if this is normal load growth + - Verify auto-scaling configuration and behavior + - Assess node capacity and CPU availability + + Follow the standard diagnostic steps but emphasize CPU-related metrics and analysis. + + - match: + issue_name: ".*[Mm]emory.*" + source_type: "azuremonitoralerts" + instructions: > + This is a memory-related alert. Focus your diagnostic analysis on: + + 1. MEMORY-SPECIFIC ANALYSIS: + - Query memory usage using container_memory_working_set_bytes and related metrics + - Check for memory leaks by analyzing memory usage trends over time + - Examine memory requests and limits vs actual usage + - Look for Out of Memory (OOM) kills in events and logs + + 2. APPLICATION MEMORY BEHAVIOR: + - Check application logs for memory-related errors or warnings + - Look for garbage collection issues in managed runtime applications (Java, .NET) + - Analyze heap dumps or memory profiles if available + - Check for inefficient memory usage patterns + + 3. SYSTEM IMPACT: + - Verify node memory availability and pressure conditions + - Check if memory pressure is affecting other pods on the same node + - Look for swap usage if applicable + - Assess overall cluster memory capacity + + Follow the standard diagnostic steps but emphasize memory-related metrics and analysis. + + - match: + issue_name: ".*[Pp]od.*[Ww]aiting.*" + source_type: "azuremonitoralerts" + instructions: > + This alert indicates pods are in a waiting state. Focus your analysis on: + + 1. POD STATE ANALYSIS: + - Check pod status and container states using kubectl describe + - Identify the specific waiting reason (ImagePullBackOff, CrashLoopBackOff, etc.) + - Examine pod events for scheduling or startup issues + - Check init containers if they exist + + 2. RESOURCE AND SCHEDULING: + - Verify node capacity and resource availability for scheduling + - Check resource requests vs available cluster capacity + - Look for node selectors, affinity rules, or taints preventing scheduling + - Examine persistent volume claims if storage is involved + + 3. IMAGE AND CONFIGURATION: + - Verify image availability and registry connectivity + - Check image pull secrets and registry authentication + - Validate container configuration and environment variables + - Look for configuration map or secret mounting issues + + Follow the standard diagnostic steps but emphasize pod lifecycle and scheduling analysis. + + - match: + issue_name: ".*[Dd]eployment.*[Rr]eplica.*" + source_type: "azuremonitoralerts" + instructions: > + This is a deployment replica mismatch alert (like KubeDeploymentReplicasMismatch). Focus your analysis on: + + 1. DEPLOYMENT REPLICA ANALYSIS: + - **FIRST: Use `execute_alert_promql_query` with the alert ID to see the replica mismatch timeline** + - Query `kube_deployment_status_replicas` vs `kube_deployment_spec_replicas` with range queries + - Query `kube_deployment_status_ready_replicas` to see how many replicas are actually ready + - Query `kube_deployment_status_available_replicas` to check availability + - Use 2-6 hour time ranges to understand the pattern of replica mismatches + + 2. POD CREATION AND LIFECYCLE: + - Query `kube_pod_status_phase` for the affected deployment to see pod states over time + - Check for pod creation/deletion patterns with `rate(kube_pod_created[5m])` + - Look for pods stuck in Pending, ContainerCreating, or other waiting states + - Use kubectl to check current pod status and recent events + + 3. RESOURCE AND SCALING ANALYSIS: + - Check if resource constraints are preventing pod creation + - Query node capacity: `kube_node_status_allocatable` for CPU/memory + - Query resource requests vs availability + - Check if HPA (Horizontal Pod Autoscaler) is affecting replica counts + - Look for node pressure or taints preventing scheduling + + 4. DEPLOYMENT CONFIGURATION: + - Check deployment strategy (RollingUpdate vs Recreate) + - Verify resource requests and limits are reasonable + - Check for pod disruption budgets that might limit scaling + - Look for node selectors or affinity rules affecting placement + + 5. TIMELINE CORRELATION: + - Compare replica mismatch timing with deployment rollouts or scaling events + - Look for infrastructure changes or node issues at the same time + - Check if the mismatch is temporary (during deployments) or persistent + - Correlate with resource pressure or cluster capacity issues + + 6. ROOT CAUSE IDENTIFICATION: + - Distinguish between temporary scaling delays vs persistent issues + - Identify if pods can't start due to resources, images, or configuration + - Check if the deployment is stuck in a rollout or experiencing constant restarts + - Determine if this is a capacity planning issue or a configuration problem + + Follow the standard diagnostic steps but emphasize deployment scaling and pod lifecycle analysis. + + - match: + issue_name: ".*[Cc]ontainer.*[Ww]aiting.*" + source_type: "azuremonitoralerts" + instructions: > + This is a container waiting alert (like KubeContainerWaiting). Focus your analysis on: + + 1. CONTAINER STATE ANALYSIS: + - **FIRST: Use `execute_alert_promql_query` with the alert ID to see the waiting container timeline** + - Query `kube_pod_container_status_waiting_reason` to identify specific waiting reasons + - Query `kube_pod_container_status_waiting` over time to see patterns + - Use kubectl describe pod to get detailed container status and events + + 2. WAITING REASON INVESTIGATION: + - **ImagePullBackOff**: Check image availability, registry access, pull secrets + - **CrashLoopBackOff**: Examine container logs for startup failures + - **CreateContainerConfigError**: Check ConfigMap/Secret references + - **InvalidImageName**: Verify image names and tags + - **ContainerCreating**: Check resource availability and volume mounts + + 3. IMAGE AND REGISTRY ANALYSIS: + - Verify image exists in the specified registry + - Check image pull secrets and registry authentication + - Test registry connectivity from nodes + - Check for image size issues or registry rate limiting + + 4. RESOURCE AND CONFIGURATION: + - Check resource requests vs node availability + - Verify volume mounts and persistent volume claims + - Check security contexts and admission controllers + - Look for init container dependencies + + 5. TIMELINE AND PATTERN ANALYSIS: + - Check if waiting is consistent or intermittent + - Correlate with deployment rollouts or configuration changes + - Look for node-specific issues (does it happen on specific nodes?) + - Check for cluster-wide vs application-specific problems + + Follow the standard diagnostic steps but emphasize container startup and configuration analysis. diff --git a/holmes/core/runbooks.py b/holmes/core/runbooks.py index 0f2b60d61..f1f9129aa 100644 --- a/holmes/core/runbooks.py +++ b/holmes/core/runbooks.py @@ -24,3 +24,21 @@ def get_instructions_for_issue(self, issue: Issue) -> List[str]: instructions.append(runbook.instructions) return instructions + + def get_matched_runbooks_for_issue(self, issue: Issue) -> List[Runbook]: + """Get the actual Runbook objects that match an issue (for accessing metadata like file paths)""" + matched_runbooks = [] + for runbook in self.runbooks: + if runbook.match.issue_id and not runbook.match.issue_id.match(issue.id): + continue + if runbook.match.issue_name and not runbook.match.issue_name.match( + issue.name + ): + continue + if runbook.match.source and not runbook.match.source.match( + issue.source_type + ): + continue + matched_runbooks.append(runbook) + + return matched_runbooks diff --git a/holmes/core/tool_calling_llm.py b/holmes/core/tool_calling_llm.py index cd18a260f..9c61b6d00 100644 --- a/holmes/core/tool_calling_llm.py +++ b/holmes/core/tool_calling_llm.py @@ -1,6 +1,7 @@ import concurrent.futures import json import logging +import os import textwrap from typing import Dict, List, Optional, Type, Union @@ -815,8 +816,21 @@ def investigate( runbooks.extend(instructions.instructions) if console and runbooks: + # Get the matched runbooks to show file names + matched_runbooks = self.runbook_manager.get_matched_runbooks_for_issue(issue) + runbook_files = [] + for i, runbook in enumerate(matched_runbooks, 1): + try: + file_path = runbook.get_path() + # Extract just the filename from the full path + filename = os.path.basename(file_path) + runbook_files.append(f"#{i}: {filename}") + except Exception: + # Fallback to a generic name if path is not available + runbook_files.append(f"#{i}: runbook") + console.print( - f"[bold]Analyzing with {len(runbooks)} runbooks: {runbooks}[/bold]" + f"[bold]Analyzing with {len(runbooks)} runbooks: {runbook_files}[/bold]" ) elif console: console.print( diff --git a/holmes/core/toolset_manager.py b/holmes/core/toolset_manager.py index 7177cdac4..ebf4aceb3 100644 --- a/holmes/core/toolset_manager.py +++ b/holmes/core/toolset_manager.py @@ -90,6 +90,13 @@ def _list_all_toolsets( if enable_all_toolsets: for toolset in toolsets_by_name.values(): toolset.enabled = True + + # Special handling: disable specific toolsets that should never be enabled by default + # even when enable_all_toolsets=True (CLI mode) + toolsets_to_keep_disabled = ["azuremonitorlogs"] + for toolset_name in toolsets_to_keep_disabled: + if toolset_name in toolsets_by_name: + toolsets_by_name[toolset_name].enabled = False # build-in toolset is enabled when it's explicitly enabled in the toolset or custom toolset config if self.toolsets is not None: diff --git a/holmes/main.py b/holmes/main.py index 9f8adecf2..4ab2c7c67 100644 --- a/holmes/main.py +++ b/holmes/main.py @@ -851,6 +851,97 @@ def pagerduty( write_json_file(json_output_file, results) +@investigate_app.command() +def azuremonitormetrics( + alerttype: str = typer.Option( + "prometheusmetrics", + help="Type of alerts to investigate. Currently supports 'prometheusmetrics'", + ), + alertid: str = typer.Argument( + help="Alert ID to investigate (required). Use 'holmes ask' to list available alert IDs first.", + ), + cluster_resource_id: Optional[str] = typer.Option( + None, + help="Azure resource ID of the AKS cluster (optional, will auto-detect if not provided)", + ), + # common options + api_key: Optional[str] = opt_api_key, + model: Optional[str] = opt_model, + config_file: Optional[Path] = opt_config_file, # type: ignore + custom_toolsets: Optional[List[Path]] = opt_custom_toolsets, + custom_runbooks: Optional[List[Path]] = opt_custom_runbooks, + max_steps: Optional[int] = opt_max_steps, + verbose: Optional[List[bool]] = opt_verbose, + json_output_file: Optional[str] = opt_json_output_file, + # advanced options for this command + system_prompt: Optional[str] = typer.Option( + "builtin://generic_investigation.jinja2", help=system_prompt_help + ), + post_processing_prompt: Optional[str] = opt_post_processing_prompt, +): + """ + Investigate a specific Azure Monitor Prometheus metric alert by ID + """ + console = init_logging(verbose) + + if alerttype != "prometheusmetrics": + console.print(f"[bold red]Error: Currently only 'prometheusmetrics' alert type is supported.[/bold red]") + return + + config = Config.load_from_file( + config_file, + api_key=api_key, + model=model, + max_steps=max_steps, + custom_toolsets_from_cli=custom_toolsets, + custom_runbooks=custom_runbooks, + ) + + # Create the issue investigator + ai = config.create_console_issue_investigator() + + # Create the Azure Monitor alerts source + try: + from holmes.plugins.sources.azuremonitoralerts import AzureMonitorAlertsSource + source = AzureMonitorAlertsSource(cluster_resource_id=cluster_resource_id) + + except Exception as e: + console.print(f"[bold red]Error: Failed to initialize Azure Monitor alerts source: {str(e)}[/bold red]") + return + + try: + # Investigate specific alert + console.print(f"[bold yellow]Fetching alert {alertid}...[/bold yellow]") + issue = source.fetch_issue(alertid) + if not issue: + console.print(f"[bold red]Alert {alertid} not found or is not a Prometheus metric alert for the cluster.[/bold red]") + console.print(f"[bold yellow]Use 'holmes ask \"show me all active azure monitor metric alerts\"' to list available alert IDs.[/bold yellow]") + return + + except Exception as e: + logging.error("Failed to fetch alert from Azure Monitor", exc_info=e) + console.print(f"[bold red]Error: Failed to fetch alert from Azure Monitor: {str(e)}[/bold red]") + return + + console.print(f"[bold yellow]Analyzing Azure Monitor Prometheus alert: {issue.name}...[/bold yellow]") + + result = ai.investigate( + issue=issue, + prompt=system_prompt, # type: ignore + console=console, + instructions=None, + post_processing_prompt=post_processing_prompt, + ) + + console.print(Rule()) + console.print(f"[bold green]AI analysis of alert: {issue.name}[/bold green]") + console.print(Markdown(result.result.replace("\n", "\n\n")), style="bold green") # type: ignore + console.print(Rule()) + + if json_output_file: + result_data = {"issue": issue.model_dump(), "result": result.model_dump()} + write_json_file(json_output_file, result_data) + @investigate_app.command() def opsgenie( opsgenie_api_key: str = typer.Option(None, help="The OpsGenie API key"), diff --git a/holmes/plugins/runbooks/azure_monitor_logs_cost_optimization.md b/holmes/plugins/runbooks/azure_monitor_logs_cost_optimization.md new file mode 100644 index 000000000..b238d943d --- /dev/null +++ b/holmes/plugins/runbooks/azure_monitor_logs_cost_optimization.md @@ -0,0 +1,1058 @@ +# Azure Monitor Container Insights Cost Optimization - Analysis Runbook + +## Goal + +**ALWAYS EXECUTE LIVE ANALYSIS**: When asked about Container Insights cost optimization, IMMEDIATELY run KQL queries using azure mcp server tool to analyze the actual cluster. Do NOT provide generic recommendations - ALWAYS execute the 7 KQL query sets below to get real cost data and specific USD savings calculations. + +**MANDATORY DUAL TABLE ANALYSIS**: This runbook MUST analyze BOTH ContainerLog (v1) AND ContainerLogV2 volumes/costs separately, then recommend migrating from ContainerLog to ContainerLogV2 to unlock Basic Logs tier optimization and kubernetes metadata collection (these optimizations ONLY work with ContainerLogV2). + +**MANDATORY EXECUTION**: This runbook requires active query execution, not passive reference. Execute all queries to provide cluster-specific cost optimization with actual data. + +--- + +## Analysis Workflow + +**EXECUTE IMMEDIATELY** - Do not provide theoretical recommendations: + +* **STEP 1**: Use azuremonitorlogs toolset to detect Log Analytics workspace and cluster resource ID + - If workspace/cluster not immediately detectable, use `check_aks_cluster_context` to get current cluster + - If resource group needed, use `get_aks_cluster_resource_id` with available cluster names +* **STEP 1B**: **CRITICAL - Parse Resource Group**: Extract resource group from workspace resource ID + - Parse format: `/subscriptions/{sub}/resourceGroups/{RESOURCE_GROUP}/providers/Microsoft.OperationalInsights/workspaces/{name}` + - Extract the resource group name from the workspace resource ID for azure mcp server tool calls +* **STEP 1C**: Use azuremonitormetrics toolset to check if Azure Monitor Metrics (managed Prometheus) is enabled +* **STEP 2**: IMMEDIATELY after detecting workspace and parsing resource group, use azure mcp server tool to execute ALL 7 KQL cost analysis query sets against the detected workspace +* **STEP 3**: Replace `{CLUSTER_RESOURCE_ID}` in all queries with the actual cluster resource ID from azuremonitorlogs +* **STEP 4**: **MANDATORY DUAL ANALYSIS**: Execute BOTH ContainerLog (v1) AND ContainerLogV2 queries to compare volumes and costs +* **STEP 5**: **CRITICAL MIGRATION RECOMMENDATION**: If ContainerLog (v1) data exists, ALWAYS recommend migration to ContainerLogV2 to unlock Basic Logs tier + kubernetes metadata optimizations +* **STEP 6**: Calculate specific USD savings based on query results and provide competitive analysis +* **STEP 7**: Check Azure Monitor Metrics status and conditionally recommend enablement only if NOT already enabled +* **STEP 8**: **ALWAYS GENERATE COMPLETE EXECUTIVE REPORT** using the template below, even if some queries fail - provide available analysis and note missing data + +**MANDATORY**: Always execute queries when asked about Container Insights cost optimization - never provide generic advice without running queries first. + +**PROACTIVE DETECTION**: When asked about Container Insights cost optimization, IMMEDIATELY attempt to detect workspace and cluster information using available tools. Do NOT ask user for resource group names - try to discover them first. + +**CRITICAL**: If you detect workspace ID and cluster resource ID, DO NOT STOP - proceed immediately to execute KQL queries using azure mcp server tool. Do not ask for permission or provide "next steps" - execute the queries now. + +**FALLBACK STRATEGY**: If workspace detection fails, provide a comprehensive cost optimization analysis template with typical savings estimates and clear implementation guidance, noting that actual analysis requires workspace access. + +**MANDATORY REPORT FORMAT**: After executing all queries, you MUST generate the complete executive report exactly as shown in the Executive Report Template section below. Use the exact format with tables, competitive analysis, and structured sections. Do NOT provide brief bullet points or summaries - ALWAYS generate the full structured report with all sections. + +**CRITICAL TIMESTAMP REQUIREMENT**: When generating the report, you MUST replace `{CURRENT_UTC_TIMESTAMP}` with the actual current UTC timestamp in ISO 8601 format (e.g., 2025-07-20T17:16:00Z). DO NOT use any hardcoded or old dates - always use the current date and time. + +**REQUIRED SECTIONS**: Your response MUST include: +1. Complete header with "🤖 AI-GENERATED AZURE MONITOR LOGS COST OPTIMIZATION REPORT" +2. Cost Overview table with Current/Optimized/Savings columns +3. Competitive Analysis table comparing Azure vs all competitors +4. Action Items with specific USD savings amounts +5. Total Potential Savings calculation + +**DUAL OUTPUT REQUIREMENT**: After executing all queries, you MUST: +1. **DISPLAY the complete executive report** in the console using the exact format from the Executive Report Template +2. **IMMEDIATELY call generate_cost_optimization_pdf** tool to create a PDF file with the same report content + +**CRITICAL**: Do NOT skip displaying the full report in the console. Users must see both the complete report on screen AND receive the PDF download link. + +### Tool Usage Requirements + +1. **azuremonitorlogs toolset**: Detect workspace ID, cluster resource ID, and current configuration +2. **Resource group parsing**: Extract resource group from workspace resource ID format: `/subscriptions/{sub}/resourceGroups/{RESOURCE_GROUP}/providers/Microsoft.OperationalInsights/workspaces/{name}` +3. **azure mcp server tool**: Execute cost analysis queries using "monitor workspace log query" tool +4. **Usage table access**: Use workspace-level Usage analysis (no cluster filtering needed) +5. **Schema information**: Use correct column names - PodNamespace (not Namespace), LogMessage (not LogEntry), LogLevel (has level info) +6. **Tool parameters**: + - **--subscription**: Parse from workspace resource ID + - **--resource-group**: REQUIRED - Parse from workspace resource ID after `/resourceGroups/` and before next `/` + - **--workspace**: Use detected workspace GUID + - **--table-name**: Usage, ContainerLogV2, etc. + - **--query**: Individual KQL queries per table + +**CRITICAL**: All azure mcp server tool calls MUST include `--resource-group` parameter to avoid "Missing Required options" errors. + +**Important**: The Usage table IS accessible through azure mcp server tool. Do not skip Usage table queries - they are required for cost analysis. + +**Usage Table Filtering**: The Usage table does NOT have _ResourceId column. Use workspace-level analysis for Usage queries. + +**Data Tables Schema**: _ResourceId is a hidden column that does NOT appear in schema discovery but EXISTS in every data table (ContainerLogV2, Perf, InsightsMetrics, etc.) EXCEPT Usage table. This is normal Azure Monitor behavior - the column is invisible in schema but is ALWAYS usable in queries for filtering. + +**Tool Execution Pattern**: +- **Resource Group Extraction**: Parse from workspace resource ID: `/subscriptions/{sub}/resourceGroups/{RESOURCE_GROUP}/providers/Microsoft.OperationalInsights/workspaces/{name}` +- **Usage table**: Execute workspace-level queries using --table-name Usage --resource-group {RESOURCE_GROUP} +- **ContainerLogV2**: Execute cluster-specific queries with _ResourceId filtering --resource-group {RESOURCE_GROUP} +- **Each query**: Separate tool call per table with appropriate KQL query and resource group parameter + +--- + +## Cost Analysis Metrics and Queries + +### 1. Overall Usage and Cost Assessment + +**Step 1 - Find Cluster Data Types** - Execute using azure mcp server with any table name (e.g., --table-name ContainerLogV2): +```kql +find where TimeGenerated > ago(2h) and _ResourceId =~ "{CLUSTER_RESOURCE_ID}" +| summarize count() by Type +``` + +**Step 2 - Get Usage for Those Types** - Execute using azure mcp server with --table-name Usage: +```kql +Usage +| where TimeGenerated > ago(2h) +| where DataType in ("ContainerLog", "ContainerLogV2", "KubePodInventory", "Perf", "InsightsMetrics", "KubeEvents") +| summarize + TotalGB = sum(Quantity) / 1024, + TwoHourCostUSD = sum(Quantity) / 1024 * 2.99 +by DataType +| extend + DailyCostUSD = TwoHourCostUSD * 12, // 12 x 2-hour periods in 24 hours + MonthlyCostUSD = TwoHourCostUSD * 12 * 30, // Extrapolate to 30 days + OptimizationPotential = case( + DataType in ("Perf", "InsightsMetrics"), "HIGH: Move to Prometheus (80% savings)", + DataType == "KubePodInventory", "HIGH: Eliminate with ContainerLogV2 metadata collection (85% savings)", + DataType == "ContainerLog", "HIGH: Migrate to ContainerLogV2 + Basic tier (98% savings)", + DataType == "ContainerLogV2", "MEDIUM: Move to Basic tier (83% savings)", + "LOW: Keep essential logs" + ) +| order by TotalGB desc +``` + +### 2. Metrics-as-Logs Cost Analysis + +**Metrics Usage Query** - Execute via azure mcp server tool with --table-name Usage: +```kql +Usage +| where TimeGenerated > ago(2h) +| where DataType in ("Perf", "InsightsMetrics") +| summarize + MetricsAsLogsGB = sum(Quantity) / 1024, + TwoHourCostUSD = sum(Quantity) / 1024 * 2.99, + CurrentMonthlyCostUSD = sum(Quantity) / 1024 * 2.99 * 12 * 30 // 12 x 2-hour periods per day, 30 days +by DataType +| extend + PrometheusEquivalentCostUSD = CurrentMonthlyCostUSD * 0.20, + MonthlySavingsUSD = CurrentMonthlyCostUSD * 0.80, + SavingsPercentage = 80.0 +| project DataType, MetricsAsLogsGB, CurrentMonthlyCostUSD, PrometheusEquivalentCostUSD, MonthlySavingsUSD, SavingsPercentage +``` + +### 3. Enhanced ContainerLogV2 Namespace Analysis + +**Comprehensive Namespace Query** - Raw analysis of ALL namespaces with actionable recommendations: +```kql +ContainerLogV2 +| where TimeGenerated > ago(2h) +| where _ResourceId =~ "{CLUSTER_RESOURCE_ID}" +| summarize + LogCount = count(), + SizeBytes = sum(estimate_data_size(*)), + UniqueContainers = dcount(ContainerName), + UniquePods = dcount(PodName) +by PodNamespace +| extend + SizeGB = SizeBytes / 1024.0 / 1024.0 / 1024.0, + TwoHourAnalyticsCostUSD = SizeBytes / 1024.0 / 1024.0 / 1024.0 * 2.99, + TwoHourBasicCostUSD = SizeBytes / 1024.0 / 1024.0 / 1024.0 * 0.50, + DailyAnalyticsCostUSD = SizeBytes / 1024.0 / 1024.0 / 1024.0 * 2.99 * 12, + DailyBasicCostUSD = SizeBytes / 1024.0 / 1024.0 / 1024.0 * 0.50 * 12, + MonthlyAnalyticsCostUSD = SizeBytes / 1024.0 / 1024.0 / 1024.0 * 2.99 * 12 * 30, + MonthlyBasicCostUSD = SizeBytes / 1024.0 / 1024.0 / 1024.0 * 0.50 * 12 * 30, + PotentialMonthlySavingsUSD = SizeBytes / 1024.0 / 1024.0 / 1024.0 * 2.49 * 12 * 30 +| extend + PercentageOfTotal = round(SizeBytes * 100.0 / toscalar( + ContainerLogV2 + | where TimeGenerated > ago(2h) and _ResourceId =~ "{CLUSTER_RESOURCE_ID}" + | summarize sum(estimate_data_size(*)) + ), 2), + FilteringRecommendation = case( + SizeGB > 1.0, strcat("HIGH PRIORITY: ", PodNamespace, " (", round(SizeGB, 3), " GB/month, $", round(MonthlyAnalyticsCostUSD, 2), ") - Filter to save $", round(PotentialMonthlySavingsUSD, 2), "/month"), + SizeGB > 0.1, strcat("MEDIUM: ", PodNamespace, " (", round(SizeGB, 3), " GB/month, $", round(MonthlyAnalyticsCostUSD, 2), ") - Consider filtering to save $", round(PotentialMonthlySavingsUSD, 2), "/month"), + SizeGB > 0.01, strcat("LOW: ", PodNamespace, " (", round(SizeGB, 3), " GB/month, $", round(MonthlyAnalyticsCostUSD, 2), ") - Minimal impact"), + strcat("MINIMAL: ", PodNamespace, " (", round(SizeGB, 3), " GB/month, $", round(MonthlyAnalyticsCostUSD, 2), ") - Keep") + ), + ConfigMapEntry = case( + SizeGB > 0.05, strcat('"', PodNamespace, '"'), + "" + ) +| order by SizeGB desc +``` + +**Namespace ConfigMap Generator** - Generate dynamic ConfigMap recommendations based on analysis: +```kql +ContainerLogV2 +| where TimeGenerated > ago(2h) +| where _ResourceId =~ "{CLUSTER_RESOURCE_ID}" +| summarize + LogCount = count(), + SizeBytes = sum(estimate_data_size(*)), + UniqueContainers = dcount(ContainerName), + UniquePods = dcount(PodName) +by PodNamespace +| extend + SizeGB = SizeBytes / 1024.0 / 1024.0 / 1024.0, + MonthlyAnalyticsCostUSD = SizeBytes / 1024.0 / 1024.0 / 1024.0 * 2.99 * 12 * 30, + PotentialMonthlySavingsUSD = SizeBytes / 1024.0 / 1024.0 / 1024.0 * 2.49 * 12 * 30, + PercentageOfTotal = round(SizeBytes * 100.0 / toscalar( + ContainerLogV2 + | where TimeGenerated > ago(2h) and _ResourceId =~ "{CLUSTER_RESOURCE_ID}" + | summarize sum(estimate_data_size(*)) + ), 2) +| where SizeGB > 0.05 // Only include namespaces with meaningful volume +| extend + FilterCategory = case( + PodNamespace in ("kube-system", "kube-public", "kube-node-lease"), "SYSTEM", + PodNamespace contains "monitoring" or PodNamespace contains "prometheus" or PodNamespace contains "grafana", "MONITORING", + PodNamespace contains "logging" or PodNamespace contains "fluent" or PodNamespace contains "elastic", "LOGGING", + PodNamespace contains "istio" or PodNamespace contains "linkerd" or PodNamespace contains "envoy", "SERVICE_MESH", + SizeGB > 1.0, "HIGH_VOLUME_APP", + "NORMAL_APP" + ) +| extend + FilterPriority = case( + FilterCategory == "SYSTEM" and SizeGB > 0.5, 1, + FilterCategory == "MONITORING" and SizeGB > 0.3, 2, + FilterCategory == "LOGGING" and SizeGB > 0.3, 3, + FilterCategory == "SERVICE_MESH" and SizeGB > 0.2, 4, + FilterCategory == "HIGH_VOLUME_APP", 5, + 999 + ) +| extend + ConfigMapGuidance = case( + FilterPriority <= 5, strcat("RECOMMENDED: Add '", PodNamespace, "' to exclude_namespaces (saves $", round(PotentialMonthlySavingsUSD, 2), "/month)"), + strcat("OPTIONAL: Consider '", PodNamespace, "' for filtering if needed") + ) +| order by FilterPriority asc, SizeGB desc +| project PodNamespace, SizeGB, MonthlyAnalyticsCostUSD, PotentialMonthlySavingsUSD, PercentageOfTotal, FilterCategory, ConfigMapGuidance +``` + +**Namespace Pattern Analysis** - Simple volume and cost analysis by namespace: +```kql +ContainerLogV2 +| where TimeGenerated > ago(2h) +| where _ResourceId =~ "{CLUSTER_RESOURCE_ID}" +| summarize + TotalSizeBytes = sum(estimate_data_size(*)), + LogCount = count(), + UniqueContainers = dcount(ContainerName), + UniquePods = dcount(PodName) +by PodNamespace +| extend + TotalSizeGB = TotalSizeBytes / 1024.0 / 1024.0 / 1024.0, + MonthlyCostUSD = TotalSizeBytes / 1024.0 / 1024.0 / 1024.0 * 2.99 * 12 * 30, + BasicLogsSavingsUSD = TotalSizeBytes / 1024.0 / 1024.0 / 1024.0 * 2.49 * 12 * 30, + PercentageOfTotal = round(TotalSizeBytes * 100.0 / toscalar( + ContainerLogV2 + | where TimeGenerated > ago(2h) and _ResourceId =~ "{CLUSTER_RESOURCE_ID}" + | summarize sum(estimate_data_size(*)) + ), 2) +| order by TotalSizeGB desc +``` + +### 4. Enhanced Log Level Distribution Analysis + +**Robust Log Level Query** - Raw log level distribution analysis without filtering bias: +```kql +ContainerLogV2 +| where TimeGenerated > ago(2h) +| where _ResourceId =~ "{CLUSTER_RESOURCE_ID}" +| extend LogLevelParsed = case( + // Primary: Check structured LogLevel column (case-insensitive) + tolower(tostring(LogLevel)) == "info", "INFO", + tolower(tostring(LogLevel)) == "debug", "DEBUG", + tolower(tostring(LogLevel)) == "warning" or tolower(tostring(LogLevel)) == "warn", "WARN", + tolower(tostring(LogLevel)) == "error" or tolower(tostring(LogLevel)) == "err", "ERROR", + tolower(tostring(LogLevel)) == "fatal" or tolower(tostring(LogLevel)) == "critical", "FATAL", + tolower(tostring(LogLevel)) == "trace", "TRACE", + // Secondary: Pattern matching in LogMessage content (case-insensitive) + LogMessage matches regex @"(?i)\b(FATAL|CRITICAL)\b", "FATAL", + LogMessage matches regex @"(?i)\b(ERROR|ERR)\b", "ERROR", + LogMessage matches regex @"(?i)\b(WARN|WARNING)\b", "WARN", + LogMessage matches regex @"(?i)\b(INFO|INFORMATION)\b", "INFO", + LogMessage matches regex @"(?i)\b(DEBUG|DBG)\b", "DEBUG", + LogMessage matches regex @"(?i)\b(TRACE|TRC)\b", "TRACE", + // Tertiary: Log level indicators at start of message + LogMessage startswith "[ERROR]" or LogMessage startswith "ERROR:", "ERROR", + LogMessage startswith "[WARN]" or LogMessage startswith "WARN:", "WARN", + LogMessage startswith "[INFO]" or LogMessage startswith "INFO:", "INFO", + LogMessage startswith "[DEBUG]" or LogMessage startswith "DEBUG:", "DEBUG", + LogMessage startswith "[TRACE]" or LogMessage startswith "TRACE:", "TRACE", + // Quaternary: JSON structured log detection (simplified) + LogMessage contains '"level":"error"' or LogMessage contains '"level":"ERROR"', "ERROR", + LogMessage contains '"level":"warn"' or LogMessage contains '"level":"WARN"', "WARN", + LogMessage contains '"level":"info"' or LogMessage contains '"level":"INFO"', "INFO", + LogMessage contains '"level":"debug"' or LogMessage contains '"level":"DEBUG"', "DEBUG", + LogMessage contains '"level":"trace"' or LogMessage contains '"level":"TRACE"', "TRACE", + // Default: Categorize as UNKNOWN for analysis + "UNKNOWN" +) +| summarize + LogCount = count(), + SizeBytes = sum(estimate_data_size(*)) +by LogLevelParsed +| extend + SizeGB = SizeBytes / 1024.0 / 1024.0 / 1024.0, + TwoHourAnalyticsCostUSD = SizeBytes / 1024.0 / 1024.0 / 1024.0 * 2.99, + DailyAnalyticsCostUSD = SizeBytes / 1024.0 / 1024.0 / 1024.0 * 2.99 * 12, + MonthlyAnalyticsCostUSD = SizeBytes / 1024.0 / 1024.0 / 1024.0 * 2.99 * 12 * 30, + PercentageOfLogs = round(LogCount * 100.0 / toscalar( + ContainerLogV2 + | where TimeGenerated > ago(2h) and _ResourceId =~ "{CLUSTER_RESOURCE_ID}" + | count + ), 2), + PercentageOfVolume = round(SizeBytes * 100.0 / toscalar( + ContainerLogV2 + | where TimeGenerated > ago(2h) and _ResourceId =~ "{CLUSTER_RESOURCE_ID}" + | summarize sum(estimate_data_size(*)) + ), 2) +| order by SizeGB desc +``` + +**Log Level Summary Query** - Overall log level distribution for executive reporting: +```kql +ContainerLogV2 +| where TimeGenerated > ago(2h) +| where _ResourceId =~ "{CLUSTER_RESOURCE_ID}" +| extend HasStructuredLogLevel = case( + isnotempty(LogLevel) and LogLevel != "", "STRUCTURED", + LogMessage contains '"level"' or LogMessage contains '"severity"', "JSON_STRUCTURED", + LogMessage matches regex @"(?i)\[(DEBUG|INFO|WARN|ERROR|FATAL|TRACE)\]", "BRACKETED", + LogMessage matches regex @"(?i)\b(DEBUG|INFO|WARN|ERROR|FATAL|TRACE)\b", "KEYWORD_BASED", + "UNSTRUCTURED" +) +| summarize + TotalLogCount = count(), + TotalSizeBytes = sum(estimate_data_size(*)) +by HasStructuredLogLevel +| extend + TotalSizeGB = TotalSizeBytes / 1024.0 / 1024.0 / 1024.0, + TotalMonthlyCostUSD = TotalSizeBytes / 1024.0 / 1024.0 / 1024.0 * 2.99 * 12 * 30, + LogStructurePercentage = round(TotalLogCount * 100.0 / toscalar( + ContainerLogV2 + | where TimeGenerated > ago(2h) and _ResourceId =~ "{CLUSTER_RESOURCE_ID}" + | count + ), 1), + FilteringFeasibility = case( + HasStructuredLogLevel in ("STRUCTURED", "JSON_STRUCTURED"), "HIGH: Structured log levels available", + HasStructuredLogLevel in ("BRACKETED", "KEYWORD_BASED"), "MEDIUM: Pattern-based filtering possible", + "LOW: Requires application-level changes" + ) +| order by TotalSizeGB desc +``` + +### 5. Legacy ContainerLog (v1) Analysis + +**ContainerLog Full Analysis** - Complete analysis without namespace breakdown (ContainerLog v1 has limited schema): +```kql +ContainerLog +| where TimeGenerated > ago(2h) +| where _ResourceId =~ "{CLUSTER_RESOURCE_ID}" +| summarize + LogCount = count(), + SizeBytes = sum(estimate_data_size(*)), + UniqueContainers = dcount(Name), + UniquePods = dcount(tostring(split(Name, '/')[1])), + UniqueComputers = dcount(Computer) +| extend + SizeGB = SizeBytes / 1024.0 / 1024.0 / 1024.0, + TwoHourAnalyticsCostUSD = SizeBytes / 1024.0 / 1024.0 / 1024.0 * 2.99, + DailyAnalyticsCostUSD = SizeBytes / 1024.0 / 1024.0 / 1024.0 * 2.99 * 12, + MonthlyAnalyticsCostUSD = SizeBytes / 1024.0 / 1024.0 / 1024.0 * 2.99 * 12 * 30, + MigrationSavingsUSD = SizeBytes / 1024.0 / 1024.0 / 1024.0 * 22.41 * 12, + BasicTierSavingsUSD = SizeBytes / 1024.0 / 1024.0 / 1024.0 * 74.7 * 12, + TotalMigrationSavingsUSD = SizeBytes / 1024.0 / 1024.0 / 1024.0 * 97.11 * 12 +| extend + MigrationRecommendation = case( + SizeGB > 2.0, strcat("HIGH PRIORITY: Migrate to ContainerLogV2 + Basic tier - Save $", round(SizeBytes / 1024.0 / 1024.0 / 1024.0 * 97.11 * 12, 2), "/month"), + SizeGB > 0.5, strcat("MEDIUM: Migrate to ContainerLogV2 + Basic tier - Save $", round(SizeBytes / 1024.0 / 1024.0 / 1024.0 * 97.11 * 12, 2), "/month"), + SizeGB > 0.0, strcat("LOW: Migrate when convenient - Save $", round(SizeBytes / 1024.0 / 1024.0 / 1024.0 * 97.11 * 12, 2), "/month"), + "No data to migrate" + ) +``` + +**ContainerLog Volume by Computer** - Analyze legacy container logs by node: +```kql +ContainerLog +| where TimeGenerated > ago(2h) +| where _ResourceId =~ "{CLUSTER_RESOURCE_ID}" +| summarize + LogCount = count(), + SizeBytes = sum(estimate_data_size(*)), + UniqueContainers = dcount(Name) +by Computer +| extend + SizeGB = SizeBytes / 1024.0 / 1024.0 / 1024.0, + MonthlyAnalyticsCostUSD = SizeBytes / 1024.0 / 1024.0 / 1024.0 * 2.99 * 12 * 30, + MigrationSavingsUSD = SizeBytes / 1024.0 / 1024.0 / 1024.0 * 22.41 * 12, + BasicTierSavingsUSD = SizeBytes / 1024.0 / 1024.0 / 1024.0 * 74.7 * 12, + TotalMigrationSavingsUSD = SizeBytes / 1024.0 / 1024.0 / 1024.0 * 97.11 * 12 +| extend + MigrationRecommendation = case( + SizeGB > 2.0, strcat("HIGH PRIORITY: Migrate to ContainerLogV2 + Basic tier - Save $", round(SizeBytes / 1024.0 / 1024.0 / 1024.0 * 97.11 * 12, 2), "/month"), + SizeGB > 0.5, strcat("MEDIUM: Migrate to ContainerLogV2 + Basic tier - Save $", round(SizeBytes / 1024.0 / 1024.0 / 1024.0 * 97.11 * 12, 2), "/month"), + SizeGB > 0.0, strcat("LOW: Migrate when convenient - Save $", round(SizeBytes / 1024.0 / 1024.0 / 1024.0 * 97.11 * 12, 2), "/month"), + "No data to migrate" + ) +| order by SizeGB desc +``` + +**ContainerLog Source Analysis** - Identify noisy log sources for exclusion (using available columns only): +```kql +ContainerLog +| where TimeGenerated > ago(2h) +| where _ResourceId =~ "{CLUSTER_RESOURCE_ID}" +| summarize + LogCount = count(), + SizeBytes = sum(estimate_data_size(*)), + SampleLogEntries = take_any(LogEntry, 3) +by LogEntrySource, Name +| extend + SizeGB = SizeBytes / 1024.0 / 1024.0 / 1024.0, + MonthlyCostUSD = SizeBytes / 1024.0 / 1024.0 / 1024.0 * 2.99 * 12 * 30, + NoiseClassification = case( + LogEntrySource == "stdout" and (Name contains "nginx" or Name contains "apache"), "WEB_ACCESS_LOGS", + LogEntrySource == "stdout" and (Name contains "prometheus" or Name contains "grafana"), "MONITORING_LOGS", + LogEntrySource == "stdout" and Name contains "fluentd", "LOG_FORWARDING", + LogEntrySource == "stderr" and SizeGB > 1.0, "HIGH_VOLUME_ERRORS", + LogEntrySource == "stdout" and SizeGB > 2.0, "HIGH_VOLUME_STDOUT", + LogEntrySource == "stdout", "NORMAL_STDOUT", + LogEntrySource == "stderr", "NORMAL_STDERR", + "OTHER" + ), + FilteringRecommendation = case( + NoiseClassification == "WEB_ACCESS_LOGS", strcat("CONSIDER: Filter access logs - Potential $", round(MonthlyCostUSD * 0.8, 2), "/month savings"), + NoiseClassification == "MONITORING_LOGS", strcat("EVALUATE: Reduce monitoring verbosity - Potential $", round(MonthlyCostUSD * 0.6, 2), "/month savings"), + NoiseClassification == "LOG_FORWARDING", strcat("OPTIMIZE: Check log forwarding efficiency - Potential $", round(MonthlyCostUSD * 0.5, 2), "/month savings"), + SizeGB > 2.0, strcat("HIGH PRIORITY: Investigate high volume - Potential $", round(MonthlyCostUSD * 0.7, 2), "/month savings"), + "KEEP: Normal volume" + ) +| order by SizeGB desc +| take 20 +``` + +### 6. KubePodInventory Replacement Analysis + +**KubePodInventory Volume and Replacement Strategy** - Analyze inventory data for metadata collection replacement: +```kql +Usage +| where TimeGenerated > ago(2h) +| where DataType == "KubePodInventory" +| summarize + TotalGB = sum(Quantity) / 1024, + TwoHourCostUSD = sum(Quantity) / 1024 * 2.99, + DailyCostUSD = sum(Quantity) / 1024 * 2.99 * 12, // 12 x 2-hour periods per day + MonthlyCostUSD = sum(Quantity) / 1024 * 2.99 * 12 * 30, // Extrapolate to 30 days + MetadataCollectionSavingsUSD = sum(Quantity) / 1024 * 2.99 * 12 * 30 * 0.85 // 85% reduction with metadata collection +| extend + ReplacementStrategy = case( + TotalGB > 1.0, strcat("HIGH PRIORITY: Replace with metadata collection - Save $", round(MetadataCollectionSavingsUSD, 2), "/month"), + TotalGB > 0.1, strcat("MEDIUM: Replace with metadata collection - Save $", round(MetadataCollectionSavingsUSD, 2), "/month"), + TotalGB > 0.0, strcat("LOW: Replace when migrating to ContainerLogV2 - Save $", round(MetadataCollectionSavingsUSD, 2), "/month"), + "No KubePodInventory data found" + ) +| project TotalGB, DailyCostUSD, MonthlyCostUSD, MetadataCollectionSavingsUSD, ReplacementStrategy +``` + +**KubePodInventory Detail Analysis** - Understand current inventory data usage (using available columns only): +```kql +KubePodInventory +| where TimeGenerated > ago(2h) +| where ClusterName contains "{CLUSTER_RESOURCE_ID}" or _ResourceId =~ "{CLUSTER_RESOURCE_ID}" +| summarize + PodCount = dcount(PodUid), + NamespaceCount = dcount(Namespace), + RecordCount = count(), + UniqueLabels = dcount(tostring(PodLabel)) +by Namespace +| extend + MigrationPriority = case( + PodCount > 50, "HIGH: Large inventory overhead", + PodCount > 10, "MEDIUM: Moderate inventory overhead", + "LOW: Minimal inventory overhead" + ), + ReplacementStrategy = "Replace with ContainerLogV2 metadata collection for 85% cost reduction" +| order by PodCount desc +``` + +### 7. Container Volume Analysis (ContainerLogV2) + +**Container Volume Query** - Identify high-volume containers in ContainerLogV2: +```kql +ContainerLogV2 +| where TimeGenerated > ago(2h) +| where _ResourceId =~ "{CLUSTER_RESOURCE_ID}" +| summarize + LogCount = count(), + SizeBytes = sum(estimate_data_size(*)) +by ContainerName, PodNamespace +| extend + SizeGB = SizeBytes / 1024.0 / 1024.0 / 1024.0, + MonthlyCostUSD = SizeBytes / 1024.0 / 1024.0 / 1024.0 * 2.99 * 12 * 30 +| order by SizeGB desc +| take 20 +``` + +--- + +## ContainerLog v1 → ContainerLogV2 Migration Benefits + +### **Critical Migration Advantages** + +**ContainerLogV2 is the ONLY container logging table that supports Basic Logs tier conversion**, providing massive cost optimization opportunities not available with legacy ContainerLog v1. + +| **Benefit** | **ContainerLog (v1)** | **ContainerLogV2** | **Impact** | +|-------------|----------------------|-------------------|------------| +| **Basic Logs Tier Support** | ❌ Not supported | ✅ **Fully supported** | **83% cost reduction** | +| **Pricing Optimization** | Stuck at $2.99/GB | Can use $0.50/GB Basic tier | **$2.49/GB savings** | +| **KubePodInventory Replacement** | ❌ Requires separate collection | ✅ **Built-in metadata collection** | **85% inventory cost reduction** | +| **Pod Metadata** | Requires KubePodInventory | Integrated via metadata collection | **Eliminates redundant data** | +| **Namespace Filtering** | Basic exclusion only | Advanced namespace/pod filtering | **Granular cost control** | +| **Schema Efficiency** | Legacy format | Optimized schema | **25% storage efficiency** | +| **Query Performance** | Standard | Enhanced indexing | **Faster log searches** | + +### **Why ContainerLogV2 Migration is Essential** + +#### **1. Basic Logs Tier Eligibility** +- **EXCLUSIVE FEATURE**: Only ContainerLogV2 can be converted to Basic Logs tier ($0.50/GB vs $2.99/GB) +- **ContainerLog v1 limitation**: Forever locked at Analytics tier pricing +- **Cost impact**: 83% immediate savings on container log ingestion costs +- **No functionality loss**: Container debugging and troubleshooting remain fully functional + +#### **2. KubePodInventory Elimination** +- **Metadata integration**: ContainerLogV2 can collect pod labels, annotations, images directly +- **Redundancy removal**: Eliminates need for separate KubePodInventory table +- **Combined savings**: Container logs + inventory data in single optimized table +- **Metadata fields available**: podLabels, podAnnotations, podUid, image, imageID, imageRepo, imageTag + +#### **3. Architecture Simplification** +``` +BEFORE (ContainerLog v1): +├── ContainerLog table ($2.99/GB - Analytics only) +├── KubePodInventory table ($2.99/GB - Separate collection) +└── Limited filtering options + +AFTER (ContainerLogV2): +├── ContainerLogV2 table ($0.50/GB - Basic tier eligible) +├── Integrated metadata (replaces KubePodInventory) +└── Advanced filtering capabilities +``` + +### **Migration Implementation Strategy** + +#### **Step 1: Enable ContainerLogV2 Schema** +```yaml +[log_collection_settings.schema] + containerlog_schema_version = "v2" +``` + +#### **Step 2: Enable Kubernetes Metadata Collection (Replaces KubePodInventory)** +```yaml +[log_collection_settings.metadata_collection] + enabled = true + include_fields = ["podLabels","podAnnotations","podUid","image","imageID","imageRepo","imageTag"] +``` + +**Alternative Kubernetes Metadata Setting:** +```yaml +[log_collection_settings.kubernetes_metadata] + enabled = true +``` + +#### **Step 3: Disable KubePodInventory Collection** +```yaml +[metric_collection_settings.collect_kube_pod_inventory] + enabled = false +``` + +#### **Step 4: Convert to Basic Logs Tier** +```bash +az monitor log-analytics workspace table update \ + --name ContainerLogV2 --plan Basic +``` + +### **Cost Optimization Formula** + +**Legacy Architecture (ContainerLog v1):** +``` +Monthly Cost = (ContainerLog Volume × $2.99) + (KubePodInventory Volume × $2.99) +Example: (10GB × $89.70) + (2GB × $59.80) = $956.40/month +``` + +**Optimized Architecture (ContainerLogV2):** +``` +Monthly Cost = (ContainerLogV2 Volume × $0.50) + (Eliminated KubePodInventory) +Example: (10GB × $15.00) + $0 = $15.00/month +Savings: $941.40/month (98.4% reduction) +``` + +### **💰 Example Cost Impact** +- **Current Monthly Cost**: $956.40 +- **Optimized Monthly Cost**: $15.00 +- **Monthly Savings**: $941.40 +- **Annual Savings**: $11,296.80 +- **Cost Reduction**: 98.4% + +--- + +## Log Analytics Pricing Tiers: Analytics vs Basic Logs + +### Tier Comparison Overview + +| **Feature** | **Analytics Logs** | **Basic Logs** | **Cost Impact** | +|-------------|-------------------|----------------|-----------------| +| **Pricing** | $2.99/GB | $0.50/GB | **83% savings** | +| **Retention** | Up to 12 years | Max 30 days | Limited archival | +| **Query Capabilities** | Full KQL support | Basic search only | Reduced analytics | +| **Real-time Alerting** | ✅ Supported | ❌ Not supported | No live alerts | +| **Workbooks/Dashboards** | ✅ Full support | ⚠️ Limited | Reduced visualization | +| **Best Use Case** | Critical monitoring | Log archival/compliance | Different objectives | + +### ContainerLogV2 Basic Logs Assessment + +**Why ContainerLogV2 is Perfect for Basic Logs:** +- **Primary use case**: Debugging and troubleshooting (doesn't need real-time alerting) +- **Compliance/auditing**: 30-day retention often sufficient for container logs +- **Alert source**: Real-time monitoring typically comes from metrics (Prometheus), not logs +- **Query patterns**: Most container log queries are simple searches, not complex analytics +- **Cost impact**: 83% cost reduction with minimal functional impact + +### Implementation: Enable Basic Logs for ContainerLogV2 + +**Azure CLI Command:** +```bash +# Set ContainerLogV2 table to Basic Logs tier +az monitor log-analytics workspace table update \ + --resource-group \ + --workspace-name \ + --name ContainerLogV2 \ + --plan Basic +``` + +**Azure PowerShell:** +```powershell +# Alternative PowerShell command +Set-AzOperationalInsightsTable -ResourceGroupName "" ` + -WorkspaceName "" ` + -TableName "ContainerLogV2" ` + -Plan "Basic" +``` + +**Verification Query:** +```kql +# Check current table plan +.show table ContainerLogV2 details +``` + +### Basic Logs Limitations & Considerations + +**⚠️ Important Limitations:** +- **30-day retention maximum**: Cannot extend beyond 30 days +- **No real-time alerting**: Cannot create alerts directly on Basic Logs data +- **Limited KQL functions**: Some advanced analytics functions not supported +- **Search-only queries**: Complex joins and aggregations may not work +- **Workbook limitations**: Some visualizations may not function properly + +**✅ What Still Works:** +- **Basic search queries**: Text searches, simple filtering +- **Compliance/auditing**: Perfect for log retention and review +- **Manual troubleshooting**: Searching logs for specific errors or events +- **Cost optimization**: 83% cost reduction with minimal impact + +**🎯 Ideal Scenarios for Basic Logs:** +- Container stdout/stderr logs (debugging) +- Compliance and audit log retention +- Historical log analysis (non-real-time) +- High-volume, low-criticality log streams + +--- + +## ContainerLogV2 Detailed Optimization Guide + +### Log Level Optimization Strategy + +**Debug/Trace Logs Elimination:** +- **Impact**: Debug logs often represent 40-60% of total volume +- **Recommendation**: Filter out DEBUG and TRACE levels entirely +- **Implementation**: Application-level configuration + ConfigMap filtering +- **Savings**: Up to 60% volume reduction + +**Info Log Selective Filtering:** +- **Strategy**: Keep critical INFO logs, filter verbose INFO messages +- **Patterns to filter**: Health checks, routine status messages, request logging +- **Implementation**: Regex-based filtering in ConfigMap +- **Savings**: 20-40% additional volume reduction + +### Namespace-Based Optimization + +**Data-Driven Namespace Analysis:** + +Execute the namespace analysis queries to identify high-volume namespaces in your specific cluster. Common patterns may include: + +- **System namespaces**: Often generate significant operational logs +- **Application namespaces**: Vary widely by workload characteristics +- **Monitoring namespaces**: Can be high-volume depending on configuration +- **Development namespaces**: May have different optimization priorities + +**Analysis-Driven Strategy:** +- **High-volume namespaces**: Consider Basic Logs tier (83% savings) +- **Critical application namespaces**: Keep Analytics tier, implement log level filtering +- **Development/staging namespaces**: Evaluate filtering or Basic Logs based on requirements +- **Low-volume namespaces**: May not require optimization + +### Container-Specific Optimization + +**High-Volume Container Patterns:** +- **Ingress controllers**: Often generate 30-50% of cluster logs +- **Service mesh sidecars**: Istio/Linkerd proxies generate significant volume +- **Monitoring agents**: Ironically, monitoring tools can be very verbose +- **CI/CD pods**: Build and deployment containers with extensive logging + +**Optimization Approaches:** +1. **Selective logging**: Configure applications to reduce log verbosity +2. **Namespace exclusion**: Move high-volume, low-criticality workloads +3. **Container-specific rules**: Target individual high-volume containers +4. **Sampling**: Implement log sampling for high-volume patterns + +### Advanced ConfigMap Configuration + +**Comprehensive Optimization ConfigMap with Metadata Collection:** +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: container-azm-ms-agentconfig + namespace: kube-system +data: + schema-version: v1 + config-version: ver1 + log-data-collection-settings: |- + [log_collection_settings] + [log_collection_settings.stdout] + enabled = true + # Configure namespace exclusions based on your analysis results + # exclude_namespaces = ["namespace1", "namespace2"] + [log_collection_settings.stderr] + enabled = true + # Configure namespace exclusions based on your analysis results + # exclude_namespaces = ["namespace1", "namespace2"] + [log_collection_settings.schema] + containerlog_schema_version = "v2" + [log_collection_settings.filtering] + # Configure log level filtering based on your analysis results + # exclude_log_levels = ["DEBUG", "TRACE"] + # Configure regex filtering based on your analysis results + # exclude_regex_patterns = [".*pattern1.*", ".*pattern2.*"] + [log_collection_settings.metadata_collection] + # Enable metadata collection to replace KubePodInventory + enabled = true + include_fields = ["podLabels","podAnnotations","podUid","image","imageID","imageRepo","imageTag"] + [metric_collection_settings] + [metric_collection_settings.collect_kube_system_metrics] + enabled = false + [metric_collection_settings.collect_kube_pod_inventory] + # Disable KubePodInventory when using metadata collection + enabled = false +``` + +**Legacy ContainerLog (v1) Filtering ConfigMap:** +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: container-azm-ms-agentconfig + namespace: kube-system +data: + schema-version: v1 + config-version: ver1 + log-data-collection-settings: |- + [log_collection_settings] + [log_collection_settings.stdout] + enabled = true + # Configure namespace exclusions based on your analysis results + # exclude_namespaces = ["namespace1", "namespace2"] + # Configure container exclusions based on your analysis results + # exclude_containers = ["container1", "container2"] + # Configure regex filtering based on your analysis results + # exclude_regex_patterns = [".*pattern1.*", ".*pattern2.*"] + [log_collection_settings.stderr] + enabled = true + # Configure namespace exclusions based on your analysis results + # exclude_namespaces = ["namespace1", "namespace2"] + [log_collection_settings.schema] + # Migrate to v2 schema for better efficiency + containerlog_schema_version = "v2" + [metric_collection_settings] + [metric_collection_settings.collect_kube_system_metrics] + enabled = false +``` + +### Cost Calculation Formulas + +**Analytics Tier Cost:** +``` +Monthly Cost = Daily Volume (GB) × $2.99 × 30 days +``` + +**Basic Logs Tier Cost:** +``` +Monthly Cost = Daily Volume (GB) × $0.50 × 30 days +Savings = Analytics Cost - Basic Cost = Daily Volume × $2.49 × 30 +Savings Percentage = 83.3% +``` + +**Combined Optimization Impact:** +``` +Total Savings = Namespace Filtering (30-50%) + Basic Logs (83% of remaining) + Log Level Filtering (20-40% additional) +Potential Total Reduction = 85-95% of original Analytics tier cost +``` + +--- + +## Cost Optimization Opportunities + +### Primary Optimization Targets + +| **Data Type** | **Typical Cost Impact** | **Optimization Strategy** | **Expected Savings** | +|---------------|------------------------|---------------------------|---------------------| +| **Perf, InsightsMetrics** | High ($50-200/month) | Migrate to Managed Prometheus | 80% cost reduction | +| **ContainerLog** | Medium ($20-100/month) | Migrate to ContainerLogV2 | 20-30% reduction | +| **KubePodInventory** | Medium ($10-50/month) | Eliminate with ContainerLogV2 | 100% elimination | +| **System Namespaces** | Variable ($5-50/month) | ConfigMap exclusion | 100% elimination | +| **Debug/Info Logs** | Variable ($10-30/month) | Log level filtering | 50-100% reduction | + +### Azure Monitor Metrics (Managed Prometheus) Conditional Recommendations + +**CRITICAL CHECK**: Use azuremonitormetrics toolset to detect current Azure Monitor Metrics status. + +**IF Azure Monitor Metrics is ALREADY ENABLED:** +- ✅ **CONFIGURATION OPTIMAL**: Managed Prometheus is already enabled +- ✅ **Metrics Cost Optimized**: Already achieving 80% cost reduction vs metrics-as-logs +- ✅ **No Action Required**: Skip Prometheus enablement recommendation +- 📊 **Focus on**: Log optimization opportunities (Basic Logs, filtering, etc.) + +**IF Azure Monitor Metrics is NOT ENABLED:** +- 🚨 **HIGH PRIORITY RECOMMENDATION**: Enable Azure Monitor Metrics (Managed Prometheus) +- 💰 **Immediate benefit**: 80% cost reduction for all performance metrics +- 🔄 **Replace**: Perf and InsightsMetrics data streams with efficient Prometheus metrics +- 💲 **Cost impact**: Can save $40-160/month for typical clusters based on current Perf/InsightsMetrics volume +- ⏱️ **Implementation**: 5-minute setup via Azure portal or CLI + +**Enable Azure Monitor Metrics Command (only if not already enabled):** +```bash +az aks update --resource-group --name --enable-azure-monitor-metrics +``` + +**Verification**: Check azuremonitormetrics toolset confirms successful enablement. + +### Architecture Optimization Strategy + +**Optimal Configuration:** +- ✅ Azure Monitor Metrics (Managed Prometheus): ALL performance metrics +- ✅ ContainerLogV2: Container stdout/stderr logs + Kubernetes events +- ❌ Remove: Perf, InsightsMetrics, KubePodInventory tables +- ❌ Filter: High-volume namespaces based on analysis results + +--- + +## Competitive Cost Analysis + +### Post-Optimization Pricing Comparison + +Calculate costs using optimized volume and compare against competitors: + +| **Platform** | **Pricing Model** | **Calculation** | +|--------------|------------------|-----------------| +| **Azure Monitor (Optimized)** | $0.10/GB Basic Logs + Prometheus | Baseline cost | +| **Amazon CloudWatch** | $0.50/GB Logs + Metrics | 5x higher logs cost | +| **Google Cloud Logging** | $0.50/GB + Monitoring | 5x higher logs cost | +| **Datadog** | $1.27/GB + Infrastructure | 12x higher cost | +| **New Relic** | $0.30/GB + Platform | 3x higher cost | +| **Oracle OCI** | $0.30/GB + Monitoring | 3x higher cost | + +--- + +## Implementation Recommendations + +### Cost-Optimized ConfigMap Template + +Deploy this ConfigMap to implement filtering and schema optimization based on your analysis results: + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: container-azm-ms-agentconfig + namespace: kube-system +data: + schema-version: v1 + config-version: ver1 + log-data-collection-settings: |- + [log_collection_settings] + [log_collection_settings.stdout] + enabled = true + # Configure namespace exclusions based on your analysis results + # exclude_namespaces = ["namespace1", "namespace2"] + [log_collection_settings.stderr] + enabled = true + # Configure namespace exclusions based on your analysis results + # exclude_namespaces = ["namespace1", "namespace2"] + [log_collection_settings.schema] + containerlog_schema_version = "v2" + [metric_collection_settings] + [metric_collection_settings.collect_kube_system_metrics] + enabled = false +``` + +### Implementation Priority + +| **Priority** | **Action** | **Expected Savings** | **Implementation Time** | +|--------------|------------|---------------------|------------------------| +| **P1** | Enable Azure Monitor Metrics | 80% metrics cost | 5 minutes | +| **P2** | **Enable Basic Logs for ContainerLogV2** | **83% log tier cost** | **2 minutes** | +| **P3** | Deploy ConfigMap optimization | 30-50% log volume | 10 minutes | +| **P4** | Monitor and validate | Confirm savings | 24-48 hours | + +--- + +## Executive Report Template + +### Analysis Results Format + +Present findings using this structure: + +``` +🤖 AI-GENERATED AZURE MONITOR LOGS COST OPTIMIZATION REPORT + +## 📊 COST OVERVIEW +| Metric | Current | Optimized | Savings | Status | +|--------|---------|-----------|---------|--------| +| Monthly Volume | {current_GB} GB | {optimized_GB} GB | {savings_GB} GB | High Volume | +| Monthly Cost | ${current_cost} USD | ${optimized_cost} USD | ${savings_cost} USD | Expensive | + +## 🏆 COMPETITIVE ANALYSIS (After Optimization) +| Platform | Monthly Cost | vs Azure | Cost Difference | +|----------|--------------|----------|-----------------| +| Azure Monitor (Optimized) | ${azure_cost} | - | Baseline | +| Amazon CloudWatch | ${aws_cost} | +{aws_percent}% | +${aws_diff} | +| Google Cloud Logging | ${gcp_cost} | +{gcp_percent}% | +${gcp_diff} | +| Datadog | ${dd_cost} | +{dd_percent}% | +${dd_diff} | + +## 💡 ACTION ITEMS & SAVINGS BREAKDOWN + +### Immediate Actions (High Impact) +| Priority | Action | Monthly Savings | Implementation | Status | +|----------|--------|----------------|---------------|--------| +| P1 | Enable Azure Monitor Metrics (if not enabled) | $${metrics_savings} | 5 minutes | ${prometheus_status} | +| P2 | Enable Basic Logs for ContainerLogV2 | $${basic_logs_savings} | 2 minutes | Recommended | +| P3 | Deploy namespace filtering ConfigMap | $${namespace_filtering_savings} | 10 minutes | Recommended | +| P4 | Implement log level filtering | $${log_level_savings} | 15 minutes | Optional | + +### Detailed Action Items + +**1. Azure Monitor Metrics Optimization** +- **Current Status**: ${prometheus_current_status} +- **Action Required**: ${prometheus_action_required} +- **Expected Savings**: $${metrics_savings}/month (${metrics_savings_percentage}% reduction) +- **Implementation**: ${prometheus_implementation_command} + +**2. Basic Logs Tier Migration** +- **Current Tier**: Analytics ($2.99/GB) +- **Recommended Tier**: Basic ($0.50/GB) +- **Expected Savings**: $${basic_logs_savings}/month (83% tier cost reduction) +- **Volume Impact**: ${containerlogv2_volume_gb} GB/month affected +- **Implementation**: `az monitor log-analytics workspace table update --name ContainerLogV2 --plan Basic` + +**⚠️ Important Basic Logs Limitations:** +- **30-day retention maximum**: Cannot extend beyond 30 days (vs up to 12 years in Analytics) +- **No real-time alerting**: Cannot create alerts directly on Basic Logs data +- **Limited KQL functions**: Some advanced analytics functions not supported +- **Search-only queries**: Complex joins and aggregations may not work +- **Workbook limitations**: Some visualizations may not function properly +- **Best for**: Container debugging, compliance logs, historical analysis (non-real-time) +- **Not suitable for**: Real-time monitoring, long-term data retention, complex analytics + +**3. Namespace Filtering Optimization** + +**EXECUTE NAMESPACE ANALYSIS IMMEDIATELY**: When generating this report, you MUST execute the "Namespace ConfigMap Generator" query from Section 3 and provide actual namespace-by-namespace results, NOT placeholder text. + +**Namespace Analysis Results**: Based on executed ContainerLogV2 namespace queries, provide detailed breakdown: + +**Volume & Cost by Namespace:** +- Display results from "Comprehensive Namespace Query" showing each namespace's volume, cost, and filtering recommendations +- Include specific GB amounts, USD costs, and percentage of total cluster logs per namespace +- Show the actual FilteringRecommendation field results for each namespace + +**ConfigMap Implementation Guidance:** +- Display results from "Namespace ConfigMap Generator" query showing specific namespaces to filter +- Provide exact namespace names and their FilterCategory (SYSTEM, MONITORING, LOGGING, etc.) +- Include specific dollar savings amounts per namespace from the ConfigMapGuidance field + +**Sample ConfigMap with Real Data:** +Generate dynamic ConfigMap based on actual analysis results: +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: container-azm-ms-agentconfig + namespace: kube-system +data: + schema-version: v1 + config-version: ver1 + log-data-collection-settings: |- + [log_collection_settings] + [log_collection_settings.stdout] + enabled = true + # Based on analysis results - add high-volume namespaces here: + # exclude_namespaces = [ACTUAL_NAMESPACE_LIST_FROM_ANALYSIS] + [log_collection_settings.stderr] + enabled = true + # exclude_namespaces = [ACTUAL_NAMESPACE_LIST_FROM_ANALYSIS] + [log_collection_settings.schema] + containerlog_schema_version = "v2" +``` + +**MANDATORY EXECUTION**: Do NOT use generic text like "No high-volume namespaces detected" - ALWAYS execute the namespace analysis queries and provide actual cluster-specific results with real namespace names, volumes, and costs. + +**4. Log Level Filtering (Conditional)** +- **Debug/Trace Volume**: ${debug_trace_volume_gb} GB/month (${debug_trace_status}) +- **Info Log Volume**: ${info_volume_gb} GB/month (${info_volume_status}) +- **Log Structure Assessment**: ${log_structure_feasibility} +- **Expected Savings**: $${log_level_savings}/month +- **Implementation**: ${log_level_implementation_strategy} +- **Alternative**: If structured levels unavailable, focus on application-level logging configuration + +### Total Optimization Impact +- **Current Monthly Cost**: $${current_total_cost} +- **Optimized Monthly Cost**: $${optimized_total_cost} +- **Total Monthly Savings**: $${total_savings} (${total_savings_percentage}% reduction) +- **Annual Savings**: $${annual_savings} + +### Implementation Timeline +- **Week 1**: Enable Azure Monitor Metrics + Basic Logs (P1, P2) +- **Week 2**: Deploy namespace filtering ConfigMap (P3) +- **Week 3**: Implement log level filtering (P4) +- **Week 4**: Monitor and validate savings +``` + +--- + +## Data Sources and Documentation + +### Required Tools Integration +- **azuremonitorlogs**: Workspace and cluster resource ID detection +- **azuremonitormetrics**: Prometheus configuration detection +- **azure mcp server**: KQL query execution against Log Analytics +- **kubectl**: ConfigMap analysis and deployment + +### Reference Documentation +- Container Insights Configuration: https://learn.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-data-collection-configure +- Azure Monitor Metrics: https://learn.microsoft.com/en-us/azure/azure-monitor/containers/prometheus-metrics-enable +- Basic Logs Tier: https://learn.microsoft.com/en-us/azure/azure-monitor/logs/logs-table-plans + +--- + +## Analysis Disclaimers + +**🤖 AI-Generated Analysis**: This cost optimization analysis is generated by HolmesGPT AI. All recommendations should be independently verified by Azure specialists before implementation. + +**📊 Data Limitations**: Analysis based on 24-hour samples extrapolated to monthly estimates using generic Azure Monitor pricing. Actual costs may vary by region and enterprise agreements. + +**⚠️ Verification Required**: Test all changes in non-production environments first. Monitor actual costs after implementation to validate estimates. diff --git a/holmes/plugins/runbooks/azuremonitor-alerts.yaml b/holmes/plugins/runbooks/azuremonitor-alerts.yaml new file mode 100644 index 000000000..5ba84ebc7 --- /dev/null +++ b/holmes/plugins/runbooks/azuremonitor-alerts.yaml @@ -0,0 +1,151 @@ +# Built-in runbooks for Azure Monitor alerts +# These runbooks provide systematic diagnostic guidance for Azure Monitor Prometheus metric alerts +runbooks: + # Generic diagnostic runbook for all Azure Monitor alerts + - match: + source: "azuremonitoralerts" + instructions: > + Perform comprehensive diagnostic analysis for this Azure Monitor alert using a systematic approach: + + 1. ALERT CONTEXT ANALYSIS: + - Extract and analyze the alert details: metric name, query, threshold, severity, and current state + - Identify the timeframe when the alert fired and duration + - Determine the affected resources (pods, nodes, services, namespaces) from the alert + - Understand what the alert is measuring and why it triggered + + 2. CURRENT STATE ASSESSMENT: + - Use kubectl commands to check the current status of affected resources + - Query current values of the alert metric using Azure Monitor Prometheus queries + - Compare current values with the alert threshold to see if issue persists + - Check if the alert is still active or has resolved + + 3. RESOURCE INVESTIGATION: + - Examine the health and status of affected pods, nodes, or services + - Check resource requests, limits, and actual utilization + - Look for recent changes in replica counts, node status, or resource allocation + - Identify any resource constraints or scheduling issues + + 4. METRIC CORRELATION AND TRENDS: + - Query related Azure Monitor Prometheus metrics around the alert timeframe + - Analyze trends for the last 1-2 hours to understand the pattern leading to the alert + - Correlate with other important metrics (CPU, memory, network, disk) to find relationships + - Look for sudden spikes, gradual increases, or cyclical patterns + + 5. EVENT TIMELINE ANALYSIS: + - Check Kubernetes events around the alert firing time using kubectl + - Look for recent deployments, pod restarts, scaling events, or configuration changes + - Correlate timing of events with the alert onset to identify potential triggers + - Check for any failed operations or warning events + + 6. LOG ANALYSIS: + - Examine logs from affected pods and containers for error messages or warnings + - Look for application-specific errors, performance issues, or resource exhaustion messages + - Check system logs if the alert is infrastructure-related (node issues, etc.) + - Search for patterns that coincide with the alert timing + + 7. DEPENDENCY AND SERVICE ANALYSIS: + - If alert affects application pods, check dependent services and databases + - Verify network connectivity and service discovery functionality + - Check ingress controllers, load balancers, or external dependencies + - Analyze service mesh metrics if applicable + + 8. ROOT CAUSE HYPOTHESIS: + - Based on metrics, events, logs, and resource analysis, form clear hypotheses about the root cause + - Prioritize the most likely causes based on evidence strength + - Explain the chain of events that led to the alert condition + - Distinguish between symptoms and actual root causes + + 9. IMPACT ASSESSMENT: + - Determine what users or services are affected by this alert condition + - Assess the severity and scope of the impact + - Check if there are cascading effects on other systems or services + - Evaluate business impact if applicable + + 10. REMEDIATION RECOMMENDATIONS: + - Suggest immediate actions to resolve the alert condition if appropriate + - Recommend monitoring steps to verify resolution + - Propose preventive measures to avoid recurrence + - Identify any configuration changes or scaling actions needed + + Use available toolsets systematically: Azure Monitor Metrics for querying, Kubernetes for resource analysis, and Bash for kubectl commands. Present findings clearly with supporting data and specific next steps. + + # Specialized runbook for high CPU usage alerts + - match: + source: "azuremonitoralerts" + issue_name: ".*[Hh]igh [Cc][Pp][Uu].*" + instructions: > + This is a high CPU usage alert. Focus your diagnostic analysis on: + + 1. CPU-SPECIFIC ANALYSIS: + - Query CPU usage trends using container_cpu_usage_seconds_total and rate() functions + - Identify which specific pods/containers are consuming the most CPU + - Check CPU requests and limits vs actual usage + - Analyze CPU throttling metrics if available + + 2. APPLICATION PERFORMANCE: + - Look for application logs indicating performance issues or increased load + - Check for recent deployments that might have introduced performance regressions + - Analyze request rates and response times if this is a web application + - Look for resource-intensive operations or batch jobs + + 3. SCALING AND CAPACITY: + - Check if horizontal or vertical scaling is needed + - Analyze historical CPU patterns to determine if this is normal load growth + - Verify auto-scaling configuration and behavior + - Assess node capacity and CPU availability + + Follow the standard diagnostic steps but emphasize CPU-related metrics and analysis. + + # Specialized runbook for memory-related alerts + - match: + source: "azuremonitoralerts" + issue_name: ".*[Mm]emory.*" + instructions: > + This is a memory-related alert. Focus your diagnostic analysis on: + + 1. MEMORY-SPECIFIC ANALYSIS: + - Query memory usage using container_memory_working_set_bytes and related metrics + - Check for memory leaks by analyzing memory usage trends over time + - Examine memory requests and limits vs actual usage + - Look for Out of Memory (OOM) kills in events and logs + + 2. APPLICATION MEMORY BEHAVIOR: + - Check application logs for memory-related errors or warnings + - Look for garbage collection issues in managed runtime applications (Java, .NET) + - Analyze heap dumps or memory profiles if available + - Check for inefficient memory usage patterns + + 3. SYSTEM IMPACT: + - Verify node memory availability and pressure conditions + - Check if memory pressure is affecting other pods on the same node + - Look for swap usage if applicable + - Assess overall cluster memory capacity + + Follow the standard diagnostic steps but emphasize memory-related metrics and analysis. + + # Specialized runbook for pod waiting state alerts + - match: + source: "azuremonitoralerts" + issue_name: ".*[Pp]od.*[Ww]aiting.*" + instructions: > + This alert indicates pods are in a waiting state. Focus your analysis on: + + 1. POD STATE ANALYSIS: + - Check pod status and container states using kubectl describe + - Identify the specific waiting reason (ImagePullBackOff, CrashLoopBackOff, etc.) + - Examine pod events for scheduling or startup issues + - Check init containers if they exist + + 2. RESOURCE AND SCHEDULING: + - Verify node capacity and resource availability for scheduling + - Check resource requests vs available cluster capacity + - Look for node selectors, affinity rules, or taints preventing scheduling + - Examine persistent volume claims if storage is involved + + 3. IMAGE AND CONFIGURATION: + - Verify image availability and registry connectivity + - Check image pull secrets and registry authentication + - Validate container configuration and environment variables + - Look for configuration map or secret mounting issues + + Follow the standard diagnostic steps but emphasize pod lifecycle and scheduling analysis. diff --git a/holmes/plugins/runbooks/catalog.json b/holmes/plugins/runbooks/catalog.json index 16a339fba..b2837cd41 100644 --- a/holmes/plugins/runbooks/catalog.json +++ b/holmes/plugins/runbooks/catalog.json @@ -9,6 +9,26 @@ "update_date": "2025-07-08", "description": "Runbook to troubleshoot upgrade issues in Azure Kubernetes Service clusters", "link": "upgrade/upgrade_troubleshooting_instructions.md" + }, + { + "update_date": "2025-07-16", + "description": "Runbook to troubleshoot and investigate DNS resolution issue on Kubernetes cluster using Azure Monitor and Prometheus", + "link": "networking/dns_troubleshooting_prometheus_instructions.md" + }, + { + "update_date": "2025-07-16", + "description": "Runbook to troubleshoot and investigate node issues on Kubernetes cluster using Azure Monitor and Prometheus", + "link": "node_troubleshooting_prometheus_instructions.md" + }, + { + "update_date": "2025-07-16", + "description": "Runbook to troubleshoot and investigate pod issues on Kubernetes cluster using Azure Monitor and Prometheus", + "link": "pod_troubleshooting_prometheus_instructions.md" + }, + { + "update_date": "2025-01-19", + "description": "Comprehensive runbook to analyze Azure Monitor Container Insights log volume and costs for AKS clusters, providing detailed cost optimization recommendations with volume and USD savings calculations, competitive analysis, and implementation guidance", + "link": "azure_monitor_logs_cost_optimization.md" } ] } diff --git a/holmes/plugins/runbooks/networking/dns_troubleshooting_prometheus_instructions.md b/holmes/plugins/runbooks/networking/dns_troubleshooting_prometheus_instructions.md new file mode 100644 index 000000000..c40ffae48 --- /dev/null +++ b/holmes/plugins/runbooks/networking/dns_troubleshooting_prometheus_instructions.md @@ -0,0 +1,86 @@ +# DNS Resolution Failures – Troubleshooting Runbook (Kubernetes) + +## Goal + +Diagnose and remediate DNS resolution failures within a Kubernetes cluster. Focus on identifying misconfigurations, upstream resolver issues, or blocked traffic leading to DNS errors (e.g., `SERVFAIL`, `NXDOMAIN`, `REFUSED`, `NotImplemented`, `DNSSEC/EDNS` errors). + +## Workflow + +### 1. **Detect and Quantify DNS Failures** + +* Use Prometheus metrics from the AzureMonitorMetrics toolset. + +**PromQL Queries** + +Node level metrics + +| Metric name | Description | Extra labels | +|--------------------------------|------------------------------|------------------------| +| **cilium_forward_count_total** | Total forwarded packet count | `direction` | +| **cilium_forward_bytes_total** | Total forwarded byte count | `direction` | +| **cilium_drop_count_total** | Total dropped packet count | `direction`, `reason` | +| **cilium_drop_bytes_total** | Total dropped byte count | `direction`, `reason` | + +Pod level metrics + +| Metric name | Description | Extra Labels | +|----------------------------------|---------------------------------------------|------------------------------------------------------------------------------| +| **hubble_dns_queries_total** | Total DNS requests by query | `source` or `destination`, `query`, `qtypes` (query type) | +| **hubble_dns_responses_total** | Total DNS responses by query/response | `source` or `destination`, `query`, `qtypes` (query type), `rcode`, `ips_returned` | +| **hubble_drop_total** | Total dropped packet count | `source` or `destination`, `protocol`, `reason` | +| **hubble_tcp_flags_total** | Total TCP packets count by flag | `source` or `destination`, `flag` | +| **hubble_flows_processed_total** | Total network flows processed (L4/L7 traffic) | `source` or `destination`, `protocol`, `verdict`, `type`, `subtype` | + + +Query the metrics to determine if there are spikes in errors and where the errors are associated based on node or pod. + +--- + + +## Synthesize Findings + +Use the combination of metrics and logs to clearly state: + +> **"Pods in `X` namespace are experiencing NXDOMAIN errors due to misconfigured `nameserver` entries in `/etc/resolv.conf`."** + +> **"CoreDNS is returning `SERVFAIL` for upstream lookups—logs show timeout errors; likely due to unreachable Azure DNS servers."** + +> **"High latency and spike in `REFUSED` errors from debug pods in `team-a-ns`, combined with recent NetworkPolicy changes."** + +--- + +## Remediation Actions + +### Immediate Fixes + +| Symptom | Action | +| ------------------------ | ---------------------------------------------------------------------- | +| CoreDNS plugin/misconfig | Revert ConfigMap: `kubectl rollout undo deploy coredns -n kube-system` | +| CoreDNS crashloop | Check logs → Restart pod → Scale replicas | +| NetworkPolicy blocks | Inspect and patch rules allowing DNS (port 53 UDP/TCP) | +| Upstream DNS unreachable | Update `forward` plugin in CoreDNS to fallback DNS (e.g., 8.8.8.8) | +| DNS saturation | Scale CoreDNS, reduce noisy traffic | +| NXDOMAIN/typos | Check domain spelling, app DNS caching behavior | + +--- + +### Mid-Range Mitigation + +| Risk Area | Recommended Action | +| -------------------- | ------------------------------------------------------ | +| CoreDNS config drift | Use GitOps, `kubectl diff`, or webhook validation | +| Low observability | Enable alerts on error spikes (rcode ≠ 0) | +| Noisy apps | Rate-limit retries in app logic | +| No ownership | Tag CoreDNS with owner, set alerts for team escalation | + +--- + +## Prevention and Best Practices + +| Domain | Strategy | +| -------------------- | ------------------------------------------------------------------ | +| **Testing** | Add DNS checks in CI smoke tests | +| **Alerting** | Alerts on high `rcode≠0` error rate, p95 latency | +| **Runbooks** | Document common failures and resolution queries | +| **Automation** | Auto-notify teams on risky changes via Logic Apps/MCP | +| **Infra Resilience** | Deploy multiple DNS replicas, validate cross-zone DNS reachability | diff --git a/holmes/plugins/runbooks/node_troubleshooting_prometheus_instructions.md b/holmes/plugins/runbooks/node_troubleshooting_prometheus_instructions.md new file mode 100644 index 000000000..e59353373 --- /dev/null +++ b/holmes/plugins/runbooks/node_troubleshooting_prometheus_instructions.md @@ -0,0 +1,108 @@ +# Node Not Ready – Troubleshooting Runbook (Kubernetes) + +## Goal + +Diagnose and remediate scenarios where one or more Kubernetes nodes report as `NotReady` or `Unknown`, resulting in pod scheduling failures and application downtime. + +--- + +## Workflow + +### 1. **Detect Node Failures** + +* Use Prometheus metrics and alerts from the AzureMonitorMetrics toolset. +* Typical symptoms include: + * Node in `NotReady` or `Unknown` status + * Unschedulable pods + * Firing alerts such as `KubeNodeUnreachable` + +**Core signals** + +| **Metric** | **Use** | **Extra Labels** | +|---------------------------------------------------------|-------------------------|----------------------------------| +| **kube_node_status_condition** | Detect NotReady status | `node`, `condition`, `status` | +| **kube_node_status_condition** | Detect Unknown status | `node`, `condition`, `status` | +| **container_memory_working_set_bytes** | Spot OOM conditions | `container`, `pod`, `namespace` | +| **node_filesystem_usage / node_filesystem_free_bytes** | Detect disk pressure | `device`, `mountpoint`, `fstype` | +| **node_disk_inode_utilization** | Detect inode pressure | `device`, `instance`, `job` | +| **container_cpu_usage_seconds_total** | Diagnose CPU starvation | `container`, `pod`, `namespace` | + + +--- + +### 2. **Contextual Signals and Metadata** + +Use metadata from kube-state-metrics and node-exporter to understand node specs and conditions. + +**Kube-State Metrics** + +| **Metric name** | **Description** | **Extra Labels** | +|--------------------------------|----------------------------|------------------------------------------| +| **kube_node_status_capacity** | Node resource capacity | `node`, `resource`, `unit` | +| **kube_node_status_condition** | Node status conditions | `node`, `condition`, `status` | +| **kube_node_status_allocatable** | Allocatable node resources | `node`, `resource`, `unit` | +| **kube_node_info** | Node OS/kernel/arch info | `node`, `architecture`, `kernel_version` | +| **kube_node_spec_taint** | Node taints | `node`, `key`, `effect` | + + +**Node Exporter Metrics** + +| **Metric name** | **Description** | **Extra Labels** | +|---------------------------------------|------------------------------------|--------------------------------------| +| **node_cpu_seconds_total** | CPU usage by mode and core | `cpu`, `mode`, `instance`, `job` | +| **node_memory_MemAvailable_bytes** | Available system memory | `instance`, `job` | +| **node_memory_Cached_bytes** | Cached memory | `instance`, `job` | +| **node_memory_MemFree_bytes** | Free memory | `instance`, `job` | +| **node_memory_Slab_bytes** | Slab allocator usage | `instance`, `job` | +| **node_memory_MemTotal_bytes** | Total system memory | `instance`, `job` | +| **node_netstat_Tcp_RetransSegs** | TCP retransmissions | `instance`, `job` | +| **node_load1 / load5 / load15** | System load averages | `instance`, `job` | +| **node_disk_read_bytes_total** | Disk read throughput | `device`, `instance`, `job` | +| **node_disk_written_bytes_total** | Disk write throughput | `device`, `instance`, `job` | +| **node_disk_io_time_seconds_total** | Disk I/O wait time | `device`, `instance`, `job` | +| **node_filesystem_size_bytes** | Total filesystem capacity | `device`, `mountpoint`, `fstype` | +| **node_filesystem_avail_bytes** | Available filesystem capacity | `device`, `mountpoint`, `fstype` | +| **node_filesystem_readonly** | Read-only filesystem flags | `device`, `mountpoint`, `fstype` | +| **node_network_receive_bytes_total** | Network ingress throughput | `device`, `instance`, `job` | +| **node_network_transmit_bytes_total** | Network egress throughput | `device`, `instance`, `job` | +| **node_network_receive_drop_total** | Dropped inbound packets | `device`, `instance`, `job` | +| **node_network_transmit_drop_total** | Dropped outbound packets | `device`, `instance`, `job` | +| **node_vmstat_pgmajfault** | Major page faults | `instance`, `job` | +| **node_exporter_build_info** | Exporter build/version metadata | `version`, `instance`, `job` | +| **node_time_seconds** | System time | `instance`, `job` | +| **node_uname_info** | Host system name and version info | `nodename`, `machine`, `release` | + +--- + +## Synthesize Findings + +> **"Node `aks-nodepool1-xyz` is reporting `NotReady` due to memory pressure."** + +> **"Multiple nodes exhibit inode saturation, impacting pod scheduling."** + +> **"CNI and containerd processes are consuming high CPU; node capacity is insufficient for current workload."** + +--- + +## Remediation Actions + +### Immediate Fixes + +| Symptom | Remediation | +| ------------------------------ | ---------------------------------------------------------------------------------------- | +| Disk pressure on `/` or `/var` | SSH/DaemonSet cleanup, check for noisy logging pods, resize disks or enable log rotation | +| Memory pressure / kubelet OOM | Reduce pod memory requests, taint & evict, restart kubelet or reimage VMSS instance | +| High CPU usage | Identify and tune sidecars/CNIs, upgrade VM size, enable autoscaling | +| Network unreachable | Validate NSG/UdR, verify MTU/routing for CNI, reimage or force update via VMSS | + +--- + +## Prevention and Best Practices + +| Domain | Strategy | +| ------------------------- | ----------------------------------------------------------------- | +| **Capacity Mgmt** | Use VPA or autoscaler, align node sizes to workload profiles | +| **Alerting** | Alerts on node NotReady, disk pressure, memory usage thresholds | +| **Logging Hygiene** | Prevent excessive logs from apps; enforce log rotation policies | +| **Deployment Guardrails** | Use policies to prevent pod overcommit and noisy containers | +| **Node Upkeep** | Periodically rotate nodes via rolling upgrade or nodepool reimage | diff --git a/holmes/plugins/runbooks/pod_troubleshooting_prometheus_instructions.md b/holmes/plugins/runbooks/pod_troubleshooting_prometheus_instructions.md new file mode 100644 index 000000000..70566077f --- /dev/null +++ b/holmes/plugins/runbooks/pod_troubleshooting_prometheus_instructions.md @@ -0,0 +1,62 @@ +# Pod Scheduling Issues – Troubleshooting Runbook (Kubernetes) + +## Goal + +Diagnose and remediate pod scheduling issues in Kubernetes clusters. These issues typically manifest as pods stuck in `Pending`, or failing with `CrashLoopBackOff`, `ImagePullBackOff`, or `CreateContainerConfigError`. These are often silent until escalated by users or backlog pressure in deployments. + +--- + +## Workflow + +- Scope the pod issues: Leverage kubectl to get context on cluster state and scope accordingly to determine the affected pods and reduce noise. +- Track the current state: Inspect resource requests, limits, and real-time usage, check for recent changes in replica counts, node conditions, or allocations, detect any resource bottlenecks or scheduling problems +- Investigate metric patterns: Query Prometheus metrics, scoped to the pod, container, or related context, analyze trends over the last 24 hours, correlate with CPU, memory, disk, or network data, and watch for spikes, increases, or cyclic activity +- Kubernetes events: Check recent events, deployments, restarts, or scaling actions, link event timing to issues, and look for failures or warnings +- Scan logs for issues: Review logs from impacted pods or containers for errors or warnings, check for app-level failures or resource issues, include system logs if relevant, match the context to the associated events and Prometheus metrics +- Review external factors: Check dependent services or databases, verify networking and discovery, inspect ingress and load balancing layers, and analyze service mesh data if in use +- Build the story: Use logs, metrics, and events to form and prioritize likely causes, trace the sequence of events to the pod issue, and separate underlying issues from downstream symptoms +- Gauge the effect: Identify impacted users or services, determine the severity and spread, look for downstream or cascading issues, and assess potential business consequences +- Take action: Recommend fixes to resolve the issue, suggest monitoring steps to confirm recovery, propose preventive changes, and highlight any config or scaling adjustments needed + + +#### Pod-level kube-state metrics + +| **Metric name** | **Description** | **Extra Labels** | +| ---------------------------------------------------------- | ------------------------------------------ | ----------------------------------------- | +| **kube_pod_container_status_last_terminated_reason** | Last container termination reason | `pod`, `container`, `reason` | +| **kube_pod_container_status_restarts_total** | Total container restarts | `pod`, `container`, `namespace` | +| **kube_pod_container_resource_requests** | Requested resources per container | `pod`, `container`, `resource` | +| **kube_pod_status_phase** | Current pod phase (Pending, Running, etc.) | `pod`, `phase`, `namespace` | +| **kube_pod_container_resource_limits** | Resource limits per container | `pod`, `container`, `resource` | +| **kube_pod_info** | Static pod info including node & IP | `pod`, `node`, `namespace` | +| **kube_pod_owner** | Pod owner reference | `pod`, `owner_kind`, `owner_name` | +| **kube_pod_labels** | Pod labels | `pod`, `label_*`, `namespace` | +| **kube_pod_annotations** | Pod annotations | `pod`, `annotation_*`, `namespace` | +| **kube_pod_container_status_waiting_reason** | Waiting status reasons (e.g. ErrImagePull) | `pod`, `container`, `reason`, `namespace` | +| **kube_pod_container_info** | Container information including image | `container`, `pod`, `namespace`, `image` | + + +#### Container cAdvisor metrics + +| **Metric name** | **Description** | **Extra Labels** | +| --------------------------------------------------------- | ------------------------------------------- | -------------------------------------------- | +| **container_spec_cpu_period** | CPU period for CFS scheduler | `container`, `pod`, `namespace`, `image` | +| **container_spec_cpu_quota** | CPU quota for CFS scheduler | `container`, `pod`, `namespace`, `image` | +| **container_cpu_usage_seconds_total** | Total CPU time used by the container | `container`, `pod`, `namespace`, `mode` | +| **container_memory_rss** | Resident memory used | `container`, `pod`, `namespace` | +| **container_memory_working_set_bytes** | Working set memory (RSS + cache - inactive) | `container`, `pod`, `namespace` | +| **container_memory_cache** | Page cache memory | `container`, `pod`, `namespace` | +| **container_memory_swap** | Swap memory usage | `container`, `pod`, `namespace` | +| **container_memory_usage_bytes** | Total memory usage | `container`, `pod`, `namespace` | +| **container_cpu_cfs_throttled_periods_total** | CPU throttling count | `container`, `pod`, `namespace` | +| **container_cpu_cfs_periods_total** | Total CFS scheduling periods | `container`, `pod`, `namespace` | +| **container_network_receive_bytes_total** | Total network bytes received | `container`, `interface`, `namespace`, `pod` | +| **container_network_transmit_bytes_total** | Total network bytes sent | `container`, `interface`, `namespace`, `pod` | +| **container_network_receive_packets_total** | Network packets received | `container`, `interface`, `namespace`, `pod` | +| **container_network_transmit_packets_total** | Network packets sent | `container`, `interface`, `namespace`, `pod` | +| **container_network_receive_packets_dropped_total** | Dropped received packets | `container`, `interface`, `namespace`, `pod` | +| **container_network_transmit_packets_dropped_total** | Dropped sent packets | `container`, `interface`, `namespace`, `pod` | +| **container_fs_reads_total** | Filesystem read ops | `container`, `device`, `namespace`, `pod` | +| **container_fs_writes_total** | Filesystem write ops | `container`, `device`, `namespace`, `pod` | +| **container_fs_reads_bytes_total** | Bytes read from filesystem | `container`, `device`, `namespace`, `pod` | +| **container_fs_writes_bytes_total** | Bytes written to filesystem | `container`, `device`, `namespace`, `pod` | diff --git a/holmes/plugins/sources/azuremonitoralerts/__init__.py b/holmes/plugins/sources/azuremonitoralerts/__init__.py new file mode 100644 index 000000000..317af014a --- /dev/null +++ b/holmes/plugins/sources/azuremonitoralerts/__init__.py @@ -0,0 +1,824 @@ +import logging +import json +from typing import List, Optional + +import requests +from holmes.core.issue import Issue +from holmes.core.tool_calling_llm import LLMResult +from holmes.plugins.interfaces import SourcePlugin +from holmes.plugins.toolsets.azuremonitor_metrics.utils import ( + get_aks_cluster_resource_id, + extract_cluster_name_from_resource_id, +) + +class AzureMonitorAlertsSource(SourcePlugin): + """Source plugin for Azure Monitor Prometheus metric alerts.""" + + # Class-level flag to prevent repeated console output across instances + _console_output_shown = False + + def __init__(self, cluster_resource_id: Optional[str] = None): + self.cluster_resource_id = cluster_resource_id or get_aks_cluster_resource_id() + self.access_token = None + + if not self.cluster_resource_id: + raise ValueError("Could not determine AKS cluster resource ID. Ensure you're running in an AKS cluster or provide cluster_resource_id.") + + # Extract subscription ID from cluster resource ID + cluster_parts = self.cluster_resource_id.split("/") + if len(cluster_parts) < 3: + raise ValueError(f"Invalid cluster resource ID format: {self.cluster_resource_id}") + + self.subscription_id = cluster_parts[2] + self.cluster_name = extract_cluster_name_from_resource_id(self.cluster_resource_id) + + def fetch_issues(self) -> List[Issue]: + """Fetch all active Prometheus metric alerts for the cluster.""" + logging.info(f"Fetching Prometheus alerts for cluster {self.cluster_name}") + + try: + # Get access token with correct scope for Azure Management API + access_token = self._get_azure_management_token() + if not access_token: + raise Exception("Could not obtain Azure access token for management API") + + headers = { + "Authorization": f"Bearer {access_token}", + "Content-Type": "application/json", + "Accept": "application/json", + } + + # Build Azure Monitor Alerts API URL + alerts_url = f"https://management.azure.com/subscriptions/{self.subscription_id}/providers/Microsoft.AlertsManagement/alerts" + + # Parameters for the API call - filter for all active alerts regardless of when they fired + api_params = { + "api-version": "2019-05-05-preview", + "alertState": "New,Acknowledged", # Get both New and Acknowledged active alerts + "monitorCondition": "Fired", # Only fired alerts (not resolved) + # Removed timeRange to get ALL active alerts regardless of when they were fired + } + + response = requests.get( + url=alerts_url, + headers=headers, + params=api_params, + timeout=60 + ) + + if response.status_code != 200: + raise Exception(f"Failed to fetch alerts: HTTP {response.status_code} - {response.text}") + + response_data = response.json() + alerts = response_data.get("value", []) + + logging.info(f"Found {len(alerts)} total alerts from Azure Monitor") + + # Filter for Prometheus metric alerts related to this cluster + prometheus_alerts = [] + processed_alerts = set() # Track processed alerts to avoid duplicates + + for i, alert in enumerate(alerts): + try: + alert_props = alert.get("properties", {}) + essentials = alert_props.get("essentials", {}) + + signal_type = essentials.get("signalType", "") + target_resource = essentials.get("targetResource", "") + target_resource_type = essentials.get("targetResourceType", "") + alert_rule = essentials.get("alertRule", "") + + # Check if this is a metric alert + if signal_type != "Metric": + continue + + # Check if alert is related to our cluster + # TargetResourceType can be either "managedclusters" or "Microsoft.ContainerService/managedClusters" + if target_resource_type.lower() not in ["managedclusters", "microsoft.containerservice/managedclusters"]: + continue + + if target_resource.lower() != self.cluster_resource_id.lower(): + continue + + # Use the actual alert ID for deduplication to show all alert instances + # Each alert instance (e.g., different pods) has a unique ID + alert_id = alert.get("id", "") + + if alert_id in processed_alerts: + continue + + processed_alerts.add(alert_id) + + if alert_rule: + # Fetch alert rule details to get the query + rule_details = self._get_alert_rule_details(alert_rule, headers) + if rule_details and self._is_prometheus_alert_rule(rule_details): + issue = self.convert_to_issue(alert, rule_details) + prometheus_alerts.append(issue) + + except Exception as e: + logging.warning(f"Failed to process alert {i+1}: {e}") + continue + + return prometheus_alerts + + except requests.RequestException as e: + raise ConnectionError("Failed to fetch data from Azure Monitor.") from e + + def fetch_issue(self, id: str) -> Optional[Issue]: + """Fetch a single alert by ID.""" + logging.info(f"Fetching specific alert {id}") + + try: + # Get access token with correct scope for Azure Management API + access_token = self._get_azure_management_token() + if not access_token: + raise Exception("Could not obtain Azure access token for management API") + + headers = { + "Authorization": f"Bearer {access_token}", + "Content-Type": "application/json", + "Accept": "application/json", + } + + # Handle both full resource path and just alert ID + if id.startswith("/subscriptions/"): + # Full resource path provided - use it directly + single_alert_url = f"https://management.azure.com{id}" + else: + # Just alert ID provided - construct the full path + single_alert_url = f"https://management.azure.com/subscriptions/{self.subscription_id}/providers/Microsoft.AlertsManagement/alerts/{id}" + + single_api_params = { + "api-version": "2019-05-05-preview", + "includeEgressConfig": "true", + } + + response = requests.get( + url=single_alert_url, + headers=headers, + params=single_api_params, + timeout=60 + ) + + if response.status_code == 404: + logging.warning(f"Alert {id} not found.") + return None + + if response.status_code != 200: + logging.error(f"Failed to get alert: {response.status_code} {response.text}") + raise Exception(f"Failed to get alert: {response.status_code} {response.text}") + + alert_data = response.json() + + # Check if this alert is a Prometheus metric alert for our cluster + alert_props = alert_data.get("properties", {}) + essentials = alert_props.get("essentials", {}) + + signal_type = essentials.get("signalType", "") + target_resource = essentials.get("targetResource", "") + target_resource_type = essentials.get("targetResourceType", "") + alert_rule = essentials.get("alertRule", "") + + logging.debug(f"Alert validation - Signal Type: {signal_type}, Target Resource Type: {target_resource_type}") + logging.debug(f"Target Resource: {target_resource}") + logging.debug(f"Expected Cluster: {self.cluster_resource_id}") + logging.debug(f"Alert Rule: {alert_rule}") + + if (signal_type == "Metric" and + target_resource_type.lower() == "microsoft.containerservice/managedclusters" and + target_resource.lower() == self.cluster_resource_id.lower()): + + if alert_rule: + rule_details = self._get_alert_rule_details(alert_rule, headers) + if rule_details and self._is_prometheus_alert_rule(rule_details): + return self.convert_to_issue(alert_data, rule_details) + else: + logging.warning(f"Alert rule {alert_rule} is not a Prometheus rule or failed to fetch details") + else: + logging.warning(f"Alert {id} has no alert rule") + else: + logging.warning(f"Alert validation failed - Signal: {signal_type}, Type: {target_resource_type}, Resource match: {target_resource.lower() == self.cluster_resource_id.lower()}") + + logging.warning(f"Alert {id} is not a Prometheus metric alert for cluster {self.cluster_name}") + return None + + except requests.RequestException as e: + logging.error(f"Connection error while fetching alert {id}: {e}") + raise ConnectionError("Failed to fetch data from Azure Monitor.") from e + + def convert_to_issue(self, source_alert: dict, rule_details: dict) -> Issue: + """Convert Azure Monitor alert to Holmes Issue object.""" + alert_props = source_alert.get("properties", {}) + essentials = alert_props.get("essentials", {}) + + # Set the current alert name for Prometheus query extraction + # Try multiple possible fields for alert name + alert_name = (essentials.get("alertName") or + essentials.get("name") or + alert_props.get("name") or + source_alert.get("name", "Unknown")) + + logging.debug(f"Alert name sources: alertName='{essentials.get('alertName')}', " + f"essentials.name='{essentials.get('name')}', " + f"properties.name='{alert_props.get('name')}', " + f"root.name='{source_alert.get('name')}', " + f"final='{alert_name}'") + + self._current_alert_name = alert_name + + # Extract query and description from rule details + query = self._extract_query_from_rule(rule_details) + rule_description = self._extract_description_from_rule(rule_details) + + # Output extracted information to console for verification (only once globally) + if not AzureMonitorAlertsSource._console_output_shown: + print(f"[Azure Monitor Alert] Alert: {alert_name}") + print(f"[Azure Monitor Alert] Query: {query}") + print(f"[Azure Monitor Alert] Rule Description: {rule_description}") + AzureMonitorAlertsSource._console_output_shown = True + + # Create formatted description + description_parts = [ + f"Alert: {essentials.get('alertName', 'Unknown')}", + f"Description: {essentials.get('description', 'No description')}", + f"Rule Description: {rule_description}", + f"Severity: {essentials.get('severity', 'Unknown')}", + f"State: {essentials.get('alertState', 'Unknown')}", + f"Fired Time: {essentials.get('firedDateTime', 'Unknown')}", + f"Query: {query}", + f"Alert Rule ID: {essentials.get('alertRule', 'Unknown')}", + ] + + description = "\n".join(description_parts) + + # Create raw data with all relevant information + raw_data = { + "alert": source_alert, + "rule_details": rule_details, + "cluster_resource_id": self.cluster_resource_id, + "cluster_name": self.cluster_name, + "extracted_query": query, + "extracted_description": rule_description, + } + + return Issue( + id=source_alert.get("id", ""), + name=essentials.get("alertName", "Azure Monitor Alert"), + source_type="azuremonitoralerts", + source_instance_id=self.cluster_resource_id, + description=description, + raw=raw_data, + ) + + def write_back_result(self, issue_id: str, result_data: LLMResult) -> None: + """Write investigation results back to Azure Monitor (currently not supported).""" + logging.info(f"Writing back result to alert {issue_id} is not currently supported for Azure Monitor alerts") + # Azure Monitor doesn't have a direct way to add comments to alerts + # This could be implemented by creating an annotation or sending to a webhook + + def _get_alert_rule_details(self, alert_rule_id: str, headers: dict) -> Optional[dict]: + """Get detailed information about an alert rule.""" + try: + # Check if this is a Prometheus rule group + if "prometheusRuleGroups" in alert_rule_id: + # Use Prometheus Rule Groups API + api_version = "2023-03-01" + else: + # Use standard metric alerts API + api_version = "2018-03-01" + + response = requests.get( + url=f"https://management.azure.com{alert_rule_id}", + headers=headers, + params={"api-version": api_version}, + timeout=30 + ) + + if response.status_code == 200: + return response.json() + + except Exception as e: + logging.warning(f"Failed to get alert rule details for {alert_rule_id}: {e}") + + return None + + def _is_prometheus_alert_rule(self, rule_details: dict) -> bool: + """Check if an alert rule is based on Prometheus metrics.""" + try: + properties = rule_details.get("properties", {}) + criteria = properties.get("criteria", {}) + + # Check if the criteria contains Prometheus-related information + all_of = criteria.get("allOf", []) + + for condition in all_of: + metric_name = condition.get("metricName", "") + metric_namespace = condition.get("metricNamespace", "") + + # Prometheus metrics in Azure Monitor typically have specific namespaces + # or metric names that indicate they're from Prometheus + if (metric_namespace and "prometheus" in metric_namespace.lower()) or \ + (metric_name and any(prom_indicator in metric_name.lower() + for prom_indicator in ["prometheus", "container_", "node_", "kube_", "up"])): + return True + + # Check if the condition has a custom query (Prometheus PromQL) + if "query" in condition or "promql" in str(condition).lower(): + return True + + # Check for additional indicators in the rule properties + rule_description = properties.get("description", "").lower() + if "prometheus" in rule_description or "promql" in rule_description: + return True + + return True # For now, assume all metric alerts could be Prometheus-based + + except Exception as e: + logging.warning(f"Failed to check if alert rule is Prometheus-based: {e}") + return False + + def _extract_query_from_rule(self, rule_details: dict) -> str: + """Extract the Prometheus query from alert rule details.""" + try: + # Check if this is a Prometheus rule group + if "prometheusRuleGroups" in rule_details.get("id", ""): + return self._extract_prometheus_query_from_rule_group(rule_details) + + # Fallback to original Azure Monitor metric alert parsing + properties = rule_details.get("properties", {}) + criteria = properties.get("criteria", {}) + all_of = criteria.get("allOf", []) + + if all_of: + condition = all_of[0] # Take the first condition + metric_name = condition.get("metricName", "") + + # For Azure Monitor metric alerts, return the metric name as the query + if metric_name: + return metric_name + + # Try to find a custom query if available + if "query" in condition: + return condition["query"] + + return "Query not available" + + except Exception as e: + logging.warning(f"Failed to extract query from rule: {e}") + return "Query extraction failed" + + def _extract_description_from_rule(self, rule_details: dict) -> str: + """Extract the description from alert rule details.""" + try: + # Check if this is a Prometheus rule group + if "prometheusRuleGroups" in rule_details.get("id", ""): + return self._extract_prometheus_description_from_rule_group(rule_details) + + # Fallback to Azure Monitor metric alert description + properties = rule_details.get("properties", {}) + + # Try to get description from different possible fields + description = (properties.get("description") or + properties.get("summary") or + properties.get("displayName")) + + if description: + return description + + return "No description available" + + except Exception as e: + logging.warning(f"Failed to extract description from rule: {e}") + return "Description extraction failed" + + def _extract_prometheus_query_from_rule_group(self, rule_details: dict) -> str: + """Extract PromQL query from Prometheus rule group for a specific alert.""" + try: + # Get the alert name from the current alert context + alert_name = getattr(self, '_current_alert_name', '') + + if not alert_name or alert_name == "Unknown": + logging.warning(f"No valid alert name available for Prometheus query extraction: '{alert_name}'") + # Try to extract from the rule details directly + return self._extract_query_from_rule_group_directly(rule_details) + + # Parse alert name to extract rule group name and alert rule name + rule_group_name, alert_rule_name = self._parse_alert_name(alert_name) + + if not rule_group_name or not alert_rule_name: + logging.warning(f"Could not parse alert name: '{alert_name}', trying direct extraction") + # Fallback to direct extraction from rule details + return self._extract_query_from_rule_group_directly(rule_details) + + logging.info(f"Extracting PromQL query for rule group: '{rule_group_name}', alert rule: '{alert_rule_name}'") + + # Fetch the complete Prometheus rule group + prometheus_rule_group = self._fetch_prometheus_rule_group(rule_group_name) + + if not prometheus_rule_group: + logging.warning(f"Could not fetch Prometheus rule group: '{rule_group_name}', trying direct extraction") + return self._extract_query_from_rule_group_directly(rule_details) + + # Extract the specific alert rule and its query + promql_query = self._find_alert_rule_query(prometheus_rule_group, alert_rule_name) + + if not promql_query: + logging.warning(f"Could not find query for alert rule: '{alert_rule_name}', trying direct extraction") + return self._extract_query_from_rule_group_directly(rule_details) + + # Apply cluster filtering to the query + from holmes.plugins.toolsets.azuremonitor_metrics.utils import enhance_promql_with_cluster_filter + enhanced_query = enhance_promql_with_cluster_filter(promql_query, self.cluster_name) + + logging.info(f"Successfully extracted and enhanced PromQL query: {enhanced_query}") + return enhanced_query + + except Exception as e: + logging.error(f"Failed to extract Prometheus query from rule group: {e}") + # Final fallback + return self._extract_query_from_rule_group_directly(rule_details) + + def _extract_prometheus_description_from_rule_group(self, rule_details: dict) -> str: + """Extract description from Prometheus rule group for a specific alert.""" + try: + # Get the alert name from the current alert context + alert_name = getattr(self, '_current_alert_name', '') + + if not alert_name or alert_name == "Unknown": + logging.warning(f"No valid alert name available for Prometheus description extraction: '{alert_name}'") + # Try to extract from the rule details directly + return self._extract_description_from_rule_group_directly(rule_details) + + # Parse alert name to extract rule group name and alert rule name + rule_group_name, alert_rule_name = self._parse_alert_name(alert_name) + + if not rule_group_name or not alert_rule_name: + logging.warning(f"Could not parse alert name: '{alert_name}', trying direct extraction") + # Fallback to direct extraction from rule details + return self._extract_description_from_rule_group_directly(rule_details) + + logging.info(f"Extracting description for rule group: '{rule_group_name}', alert rule: '{alert_rule_name}'") + + # Fetch the complete Prometheus rule group + prometheus_rule_group = self._fetch_prometheus_rule_group(rule_group_name) + + if not prometheus_rule_group: + logging.warning(f"Could not fetch Prometheus rule group: '{rule_group_name}', trying direct extraction") + return self._extract_description_from_rule_group_directly(rule_details) + + # Extract the specific alert rule and its description + rule_description = self._find_alert_rule_description(prometheus_rule_group, alert_rule_name) + + if not rule_description: + logging.warning(f"Could not find description for alert rule: '{alert_rule_name}', trying direct extraction") + return self._extract_description_from_rule_group_directly(rule_details) + + logging.info(f"Successfully extracted rule description: {rule_description}") + return rule_description + + except Exception as e: + logging.error(f"Failed to extract Prometheus description from rule group: {e}") + # Final fallback + return self._extract_description_from_rule_group_directly(rule_details) + + def _extract_description_from_rule_group_directly(self, rule_details: dict) -> str: + """ + Fallback method to extract description information directly from rule details. + Used when alert name parsing fails or specific rule lookup fails. + """ + try: + properties = rule_details.get("properties", {}) + + # Try to find any useful description information in the rule details + if "rules" in properties: + rules = properties["rules"] + logging.info(f"Attempting direct description extraction from {len(rules)} rules in group") + + # Try to find any rule with a description + for i, rule in enumerate(rules): + rule_name = rule.get("alert", "") or rule.get("record", "") or f"rule_{i}" + + # Try different possible description fields + description = (rule.get("annotations", {}).get("description") or + rule.get("annotations", {}).get("summary") or + rule.get("description") or + rule.get("summary")) + + if description: + logging.info(f"Found description in rule '{rule_name}': {description}") + return description + + # If no descriptions found, provide generic info + return f"Prometheus rule group with {len(rules)} rules (no descriptions available)" + + # If no rules found, look for other indicators + rule_name = properties.get("name", "Unknown Rule Group") + return f"Prometheus rule group: {rule_name} (no rules found for description extraction)" + + except Exception as e: + logging.warning(f"Direct description extraction failed: {e}") + return f"Description extraction failed: {str(e)}" + + def _find_alert_rule_description(self, rule_group: dict, alert_rule_name: str) -> Optional[str]: + """ + Find the description for a specific alert rule within a Prometheus rule group. + + Args: + rule_group: Complete Prometheus rule group details + alert_rule_name: Name of the specific alert rule to find + + Returns: + str: Rule description if found, None otherwise + """ + try: + properties = rule_group.get("properties", {}) + rules = properties.get("rules", []) + + logging.info(f"Searching for alert rule description '{alert_rule_name}' in {len(rules)} rules") + + # Find the exact matching rule + for rule in rules: + rule_name = rule.get("alert", "") or rule.get("record", "") + + if rule_name == alert_rule_name: + # Found the matching rule, extract description + annotations = rule.get("annotations", {}) + + # Try different description fields in order of preference + description = (annotations.get("description") if annotations else None) or \ + (annotations.get("summary") if annotations else None) or \ + rule.get("description") or \ + rule.get("summary") + + if description: + logging.info(f"Found description for '{alert_rule_name}': {description}") + return description + else: + # Rule found but no description + available_fields = list(annotations.keys()) if annotations else [] + logging.warning(f"Alert rule '{alert_rule_name}' found but has no description. Available annotation fields: {available_fields}") + return "No description available for this rule" + + # If we get here, the alert rule was not found + available_rules = [rule.get("alert", "") or rule.get("record", "") for rule in rules] + logging.warning(f"Alert rule '{alert_rule_name}' not found for description extraction. Available rules: {available_rules}") + return None + + except Exception as e: + logging.error(f"Exception while finding alert rule description for '{alert_rule_name}': {e}") + return None + + def _extract_query_from_rule_group_directly(self, rule_details: dict) -> str: + """ + Fallback method to extract query information directly from rule details. + Used when alert name parsing fails or specific rule lookup fails. + """ + try: + properties = rule_details.get("properties", {}) + + # Try to find any useful query information in the rule details + if "rules" in properties: + rules = properties["rules"] + logging.info(f"Attempting direct extraction from {len(rules)} rules in group") + + # Try to find any rule with a query + for i, rule in enumerate(rules): + # Log rule structure for debugging + rule_name = rule.get("alert", "") or rule.get("record", "") or f"rule_{i}" + rule_type = "alert" if rule.get("alert") else "record" if rule.get("record") else "unknown" + + logging.debug(f"Examining rule '{rule_name}' (type: {rule_type})") + + # Try different possible query fields + query = (rule.get("expr") or + rule.get("expression") or + rule.get("query")) + + if query: + logging.info(f"Found query in rule '{rule_name}': {query}") + # Apply cluster filtering + from holmes.plugins.toolsets.azuremonitor_metrics.utils import enhance_promql_with_cluster_filter + enhanced_query = enhance_promql_with_cluster_filter(query, self.cluster_name) + return enhanced_query + else: + # Log available fields in the rule + available_fields = list(rule.keys()) + logging.debug(f"Rule '{rule_name}' has no query field. Available fields: {available_fields}") + + # If no queries found, provide info about available rules + rule_info = [] + for rule in rules: + rule_name = rule.get("alert", "") or rule.get("record", "") or "unnamed" + rule_type = "alert" if rule.get("alert") else "record" if rule.get("record") else "unknown" + rule_info.append(f"{rule_name} ({rule_type})") + + logging.warning(f"No queries found in any rules. Available rules: {rule_info}") + return f"Prometheus rule group with {len(rules)} rules but no extractable queries" + + # If no rules found, look for other indicators + rule_name = properties.get("name", "Unknown Rule Group") + logging.warning(f"No rules found in rule group properties") + return f"Prometheus rule group: {rule_name} (no rules found for query extraction)" + + except Exception as e: + logging.warning(f"Direct query extraction failed: {e}") + return f"Query extraction failed: {str(e)}" + + def _parse_alert_name(self, alert_name: str) -> tuple[str, str]: + """ + Parse alert name to extract rule group name and alert rule name. + + Args: + alert_name: Full alert name like "Prometheus Recommended Cluster level Alerts - vishwa-tme-1/KubeContainerWaiting" + + Returns: + tuple: (rule_group_name, alert_rule_name) + """ + try: + if "/" in alert_name: + # Split on the last "/" to separate rule group from alert rule + rule_group_name = alert_name.rsplit("/", 1)[0] # "Prometheus Recommended Cluster level Alerts - vishwa-tme-1" + alert_rule_name = alert_name.rsplit("/", 1)[1] # "KubeContainerWaiting" + return rule_group_name.strip(), alert_rule_name.strip() + + # If no "/" found, assume the entire name is the alert rule name + return "", alert_name.strip() + + except Exception as e: + logging.warning(f"Failed to parse alert name '{alert_name}': {e}") + return "", "" + + def _fetch_prometheus_rule_group(self, rule_group_name: str) -> Optional[dict]: + """ + Fetch Prometheus rule group details from Azure Monitor. + + Args: + rule_group_name: Name of the Prometheus rule group + + Returns: + dict: Rule group details if found, None otherwise + """ + try: + # Get access token for Azure Management API + access_token = self._get_azure_management_token() + if not access_token: + logging.error("Could not obtain Azure access token for management API") + return None + + headers = { + "Authorization": f"Bearer {access_token}", + "Content-Type": "application/json", + "Accept": "application/json", + } + + # We need to find the rule group by name across all resource groups + # First, try to list all Prometheus rule groups in the subscription + list_url = f"https://management.azure.com/subscriptions/{self.subscription_id}/providers/Microsoft.AlertsManagement/prometheusRuleGroups" + + list_params = { + "api-version": "2023-03-01" + } + + response = requests.get( + url=list_url, + headers=headers, + params=list_params, + timeout=60 + ) + + if response.status_code != 200: + logging.error(f"Failed to list Prometheus rule groups: HTTP {response.status_code} - {response.text}") + return None + + rule_groups_list = response.json().get("value", []) + + # Find the rule group with matching name + target_rule_group = None + for rule_group in rule_groups_list: + rg_name = rule_group.get("name", "") + if rg_name == rule_group_name: + target_rule_group = rule_group + break + + if not target_rule_group: + logging.warning(f"Prometheus rule group '{rule_group_name}' not found in subscription") + return None + + # Get the full details of the specific rule group + rule_group_id = target_rule_group.get("id", "") + if not rule_group_id: + logging.error("Rule group ID not found") + return None + + detail_url = f"https://management.azure.com{rule_group_id}" + + detail_params = { + "api-version": "2023-03-01" + } + + detail_response = requests.get( + url=detail_url, + headers=headers, + params=detail_params, + timeout=60 + ) + + if detail_response.status_code == 200: + rule_group_details = detail_response.json() + logging.info(f"Successfully fetched Prometheus rule group: {rule_group_name}") + return rule_group_details + else: + logging.error(f"Failed to fetch rule group details: HTTP {detail_response.status_code} - {detail_response.text}") + return None + + except Exception as e: + logging.error(f"Exception while fetching Prometheus rule group '{rule_group_name}': {e}") + return None + + def _find_alert_rule_query(self, rule_group: dict, alert_rule_name: str) -> Optional[str]: + """ + Find the PromQL query for a specific alert rule within a Prometheus rule group. + + Args: + rule_group: Complete Prometheus rule group details + alert_rule_name: Name of the specific alert rule to find + + Returns: + str: PromQL query if found, None otherwise + """ + try: + properties = rule_group.get("properties", {}) + rules = properties.get("rules", []) + + logging.info(f"Searching for alert rule '{alert_rule_name}' in {len(rules)} rules") + + # First, try to find the exact matching rule + matching_rule = None + for rule in rules: + rule_name = rule.get("alert", "") or rule.get("record", "") + + if rule_name == alert_rule_name: + matching_rule = rule + break + + if matching_rule: + # Log the full rule structure for debugging + logging.debug(f"Found matching rule: {json.dumps(matching_rule, indent=2)}") + + # Try different possible query fields + promql_query = (matching_rule.get("expr") or + matching_rule.get("expression") or + matching_rule.get("query")) + + if promql_query: + logging.info(f"Found PromQL query for '{alert_rule_name}': {promql_query}") + return promql_query + else: + # Rule found but no query - log all available fields + available_fields = list(matching_rule.keys()) + logging.warning(f"Alert rule '{alert_rule_name}' found but has no query field. Available fields: {available_fields}") + + # Try to extract any query-like content + for field in ["expr", "expression", "query", "condition", "criteria"]: + if field in matching_rule and matching_rule[field]: + logging.info(f"Found potential query in '{field}' field: {matching_rule[field]}") + return str(matching_rule[field]) + + return None + + # If we get here, the alert rule was not found + available_rules = [] + for rule in rules: + rule_name = rule.get("alert", "") or rule.get("record", "") + rule_type = "alert" if rule.get("alert") else "record" if rule.get("record") else "unknown" + available_rules.append(f"{rule_name} ({rule_type})") + + logging.warning(f"Alert rule '{alert_rule_name}' not found. Available rules: {available_rules}") + return None + + except Exception as e: + logging.error(f"Exception while finding alert rule query for '{alert_rule_name}': {e}") + return None + + def _get_azure_management_token(self) -> Optional[str]: + """Get Azure access token for Azure Management API.""" + try: + from azure.identity import DefaultAzureCredential, AzureCliCredential + + # Try AzureCliCredential first since we know Azure CLI is working + try: + credential = AzureCliCredential() + logging.debug("Using AzureCliCredential for management API") + except Exception as cli_error: + logging.debug(f"AzureCliCredential failed: {cli_error}, falling back to DefaultAzureCredential") + credential = DefaultAzureCredential() + + # Get token with Azure management scope + token = credential.get_token("https://management.azure.com/.default") + logging.debug("Successfully obtained Azure management token") + return token.token + + except Exception as e: + logging.error(f"Failed to get Azure management access token: {e}") + return None diff --git a/holmes/plugins/toolsets/__init__.py b/holmes/plugins/toolsets/__init__.py index 6932c82ed..f46655e87 100644 --- a/holmes/plugins/toolsets/__init__.py +++ b/holmes/plugins/toolsets/__init__.py @@ -6,6 +6,27 @@ import yaml # type: ignore from pydantic import ValidationError +try: + from holmes.plugins.toolsets.azuremonitor_metrics.azuremonitor_metrics_toolset import AzureMonitorMetricsToolset + AZURE_MONITOR_METRICS_AVAILABLE = True + logging.info("Azure Monitor Metrics toolset imported successfully") +except ImportError as e: + logging.warning(f"Azure Monitor Metrics toolset not available due to ImportError: {e}") + AZURE_MONITOR_METRICS_AVAILABLE = False +except Exception as e: + logging.error(f"Failed to import Azure Monitor Metrics toolset: {e}", exc_info=True) + AZURE_MONITOR_METRICS_AVAILABLE = False + +try: + from holmes.plugins.toolsets.azuremonitorlogs.azuremonitorlogs_toolset import AzureMonitorLogsToolset + AZURE_MONITOR_LOGS_AVAILABLE = True + logging.info("Azure Monitor Logs toolset imported successfully") +except ImportError as e: + logging.warning(f"Azure Monitor Logs toolset not available due to ImportError: {e}") + AZURE_MONITOR_LOGS_AVAILABLE = False +except Exception as e: + logging.error(f"Failed to import Azure Monitor Logs toolset: {e}", exc_info=True) + AZURE_MONITOR_LOGS_AVAILABLE = False import holmes.utils.env as env_utils from holmes.common.env_vars import USE_LEGACY_KUBERNETES_LOGS from holmes.core.supabase_dal import SupabaseDal @@ -93,6 +114,20 @@ def load_python_toolsets(dal: Optional[SupabaseDal]) -> List[Toolset]: AzureSQLToolset(), ServiceNowToolset(), ] + + # Add Azure Monitor Metrics toolset if available + if AZURE_MONITOR_METRICS_AVAILABLE: + logging.info("Adding Azure Monitor Metrics toolset to built-in toolsets") + toolsets.append(AzureMonitorMetricsToolset()) + else: + logging.warning("Azure Monitor Metrics toolset not available - skipping") + + # Add Azure Monitor Logs toolset if available + if AZURE_MONITOR_LOGS_AVAILABLE: + logging.info("Adding Azure Monitor Logs toolset to built-in toolsets") + toolsets.append(AzureMonitorLogsToolset()) + else: + logging.warning("Azure Monitor Logs toolset not available - skipping") if not USE_LEGACY_KUBERNETES_LOGS: toolsets.append(KubernetesLogsToolset()) @@ -122,6 +157,11 @@ def load_builtin_toolsets(dal: Optional[SupabaseDal] = None) -> List[Toolset]: toolset.type = ToolsetType.BUILTIN # dont' expose build-in toolsets path toolset.path = None + # Set enabled status based on is_default property + if hasattr(toolset, 'is_default'): + toolset.enabled = toolset.is_default + else: + toolset.enabled = False # Default to disabled if no is_default property return all_toolsets # type: ignore diff --git a/holmes/plugins/toolsets/azuremonitor_metrics/__init__.py b/holmes/plugins/toolsets/azuremonitor_metrics/__init__.py new file mode 100644 index 000000000..de34e676c --- /dev/null +++ b/holmes/plugins/toolsets/azuremonitor_metrics/__init__.py @@ -0,0 +1 @@ +"""Azure Monitor Metrics toolset for querying Azure Monitor managed Prometheus metrics.""" diff --git a/holmes/plugins/toolsets/azuremonitor_metrics/azuremonitor_metrics_instructions.jinja2 b/holmes/plugins/toolsets/azuremonitor_metrics/azuremonitor_metrics_instructions.jinja2 new file mode 100644 index 000000000..3824a16d4 --- /dev/null +++ b/holmes/plugins/toolsets/azuremonitor_metrics/azuremonitor_metrics_instructions.jinja2 @@ -0,0 +1,148 @@ +You have access to Azure Monitor Metrics tools for querying Azure Monitor managed Prometheus metrics from AKS clusters. This toolset is designed to work from external environments (like local development machines) and connects to AKS clusters remotely via Azure APIs. + +## Available Tools: +{% for tool_name in tool_names %} +- {{ tool_name }} +{% endfor %} + +## Key Capabilities: +- Connect to AKS clusters from any environment with proper Azure credentials +- Auto-discover AKS cluster configuration using Azure Resource Graph +- Check if Azure Monitor managed Prometheus is enabled for specified clusters +- Execute PromQL queries against Azure Monitor workspaces with automatic cluster filtering +- Query both instant values and time-series data ranges +- **List active Prometheus metric alerts** for investigation workflow +- Support manual configuration for specific clusters + +## Important Usage Guidelines: + +### 1. Configuration and Setup: +The toolset works in two modes: +- **Auto-detection**: Attempts to discover available AKS clusters using Azure credentials +- **Manual configuration**: Uses explicitly configured cluster details from config.yaml + +### 2. Setup Workflow: +To get started: +1. Ensure Azure credentials are configured (az login or environment variables) +2. Run `check_azure_monitor_prometheus_enabled` to verify and configure the cluster +3. Execute Prometheus queries once the workspace is configured + +### 3. Automatic Cluster Filtering: +- All PromQL queries are automatically filtered by the cluster name using the "cluster" label +- This ensures queries only return metrics for the current AKS cluster +- You can disable auto-filtering by setting `auto_cluster_filter: false` if needed +- The cluster filtering helps avoid confusion when multiple clusters send metrics to the same Azure Monitor workspace + +### 4. Query Types: +- Use `execute_azuremonitor_prometheus_query` for instant/current values +- Use `execute_azuremonitor_prometheus_range_query` for time-series data and trends +- Always provide meaningful descriptions for queries to help with analysis + +### 5. Error Handling: +- If Azure Monitor managed Prometheus is not enabled, guide the user to enable it in Azure portal +- If no cluster is specified, suggest providing cluster_resource_id or configuring it in config.yaml +- If queries return no data, check if the metric exists and cluster filtering is correct +- For authentication issues, verify Azure credentials and permissions + +### 6. Common AKS Metrics to Query: +- `container_cpu_usage_seconds_total` - CPU usage by containers +- `container_memory_working_set_bytes` - Memory usage by containers +- `kube_pod_status_phase` - Pod status information +- `kube_node_status_condition` - Node health status +- `container_fs_usage_bytes` - Filesystem usage +- `kube_deployment_status_replicas` - Deployment replica status + +### 7. Troubleshooting Scenarios: +When investigating AKS issues, consider querying: +- Resource utilization (CPU, memory, disk) +- Pod and node health status +- Application-specific metrics +- Infrastructure metrics +- Network metrics + +### 8. Alert Investigation Workflow with Query Analysis: +**IMPORTANT**: When users ask about Azure Monitor alerts, use this comprehensive approach: + +**Step 1 - List Active Alerts:** +- **MANDATORY: Use ONLY the `get_azure_monitor_alerts_with_fired_times` tool for Azure Monitor alerts** +- **ABSOLUTELY CRITICAL: Display the EXACT tool output without ANY modifications, summaries, or reformatting** +- **NEVER interpret, summarize, rewrite, or change the tool output in ANY way** +- **NEVER create your own alert summary - show the tool output EXACTLY as returned** +- **Show the complete formatted output with icons, markdown, and styling EXACTLY as returned** +- **The tool output already contains perfect formatting and ALL necessary information including:** + - **Investigation commands for each alert (copy these exactly)** + - **Full Alert IDs and short Alert IDs (both are provided)** + - **Instructions on how to investigate using `holmes investigate azuremonitormetrics prometheusmetrics `** + - Beautiful formatting with icons and professional layout (🔔 🚨 📋 ⚡ 📖 etc.) + - Alert names and complete Alert IDs (complete Azure resource paths in code blocks) + - **PromQL queries that triggered each alert** - these are crucial for investigation + - **Rule descriptions explaining what each alert monitors** - essential context for understanding alerts + - Alert descriptions, rule IDs, severity, status, fired times +- **CRITICAL: The tool output already includes the investigation instructions - DO NOT add your own** +- **Just display the tool output exactly as it appears - it's perfectly formatted and complete** + +**Step 2 - Execute Alert's Original Query for Timeline Analysis:** +When investigating a specific alert, **ALWAYS use the `execute_alert_promql_query` tool** to: +- Execute the exact PromQL query that triggered the alert +- Analyze the timeline of the metric that caused the alert +- Use different time ranges (1h, 2h, 6h, 1d) to see trends +- Example: `execute_alert_promql_query` with alert_id and time_range "2h" + +**Step 3 - Deep Investigation with Custom Queries:** +Based on the alert's query and timeline, create related queries: +- **Trend analysis**: Use `execute_azuremonitor_prometheus_range_query` to analyze trends +- **Current state**: Use `execute_azuremonitor_prometheus_query` for instant values +- **Related metrics**: Query related metrics to understand root cause +- **Resource analysis**: Query CPU, memory, disk, network metrics around the alert time + +**Step 4 - Timeline Correlation:** +- Compare the alert's fired time with metric trends +- Look for patterns before and after the alert triggered +- Identify if the issue is ongoing or resolved +- Find correlations with other system metrics + +**Query Investigation Best Practices:** +1. **Start with the alert's query**: Always execute the original PromQL query first +2. **Expand time range**: Look at longer periods (6h, 1d) to see patterns +3. **Investigate related metrics**: Query CPU, memory, pod status, deployment status +4. **Use range queries for trends**: Historical data shows patterns better than instant values +5. **Cross-reference timing**: Compare alert fired time with metric spikes/drops + +**Example Investigation Flow:** +``` +1. List alerts → get_azure_monitor_alerts_with_fired_times +2. See alert query: kube_deployment_status_replicas{cluster="myaks"} != kube_deployment_spec_replicas{cluster="myaks"} +3. Execute alert query → execute_alert_promql_query (alert_id, time_range="2h") +4. Analyze deployment status → execute_azuremonitor_prometheus_range_query with deployment queries +5. Check pod status → query kube_pod_status_phase for related pods +6. Investigate resources → query CPU/memory metrics for the timeframe +``` + +**Timeline Analysis Questions to Answer:** +- When did the metric first breach the threshold? +- Is the issue ongoing or resolved? +- What other metrics show anomalies at the same time? +- Are there patterns or recurring issues? +- What was the system state before and after the alert? + +**Display Format Requirements:** +- Always show the complete tool output with full Alert IDs +- Include the instruction text with exact tool names +- Present all alert details (severity, status, fired time, **query**, etc.) +- Highlight the PromQL queries as they're essential for investigation +- Do not abbreviate or summarize the alert information + +### 9. Time Range Considerations: +- Default time span is 1 hour for range queries +- Adjust time ranges based on when issues occurred +- Use appropriate step intervals for range queries (e.g., 60s for detailed analysis) + +{% if config and config.cluster_name %} +### Current Configuration: +- Cluster Name: {{ config.cluster_name }} +{% if config.azure_monitor_workspace_endpoint %} +- Azure Monitor Endpoint: {{ config.azure_monitor_workspace_endpoint }} +{% endif %} +{% endif %} + +Remember: Azure Monitor managed Prometheus must be enabled on the AKS cluster for these tools to work. The toolset can work from any environment with proper Azure credentials and cluster configuration. See AZURE_MONITOR_SETUP_GUIDE.md for detailed setup instructions when running from external environments. diff --git a/holmes/plugins/toolsets/azuremonitor_metrics/azuremonitor_metrics_toolset.py b/holmes/plugins/toolsets/azuremonitor_metrics/azuremonitor_metrics_toolset.py new file mode 100644 index 000000000..eb01c8cc1 --- /dev/null +++ b/holmes/plugins/toolsets/azuremonitor_metrics/azuremonitor_metrics_toolset.py @@ -0,0 +1,1306 @@ +"""Azure Monitor Metrics toolset for HolmesGPT.""" + +import json +import logging +import os +import time +from typing import Any, Dict, List, Optional, Tuple +from urllib.parse import urljoin + +import requests +from pydantic import BaseModel, field_validator +from requests import RequestException +from datetime import datetime +import dateutil.parser +import dateutil.relativedelta + +from holmes.core.tools import ( + CallablePrerequisite, + StructuredToolResult, + Tool, + ToolParameter, + ToolResultStatus, + Toolset, + ToolsetTag, +) +from holmes.plugins.toolsets.consts import STANDARD_END_DATETIME_TOOL_PARAM_DESCRIPTION +from holmes.plugins.toolsets.utils import ( + get_param_or_raise, + process_timestamps_to_rfc3339, + standard_start_datetime_tool_param_description, +) + +from azure.identity import DefaultAzureCredential, AzureCliCredential +from azure.core.exceptions import AzureError + +from .utils import ( + check_if_running_in_aks, + extract_cluster_name_from_resource_id, + get_aks_cluster_resource_id, + get_azure_monitor_workspace_for_cluster, + enhance_promql_with_cluster_filter, +) + +DEFAULT_TIME_SPAN_SECONDS = 3600 + +class AzureMonitorMetricsConfig(BaseModel): + """Configuration for Azure Monitor Metrics toolset.""" + azure_monitor_workspace_endpoint: Optional[str] = None + cluster_name: Optional[str] = None + cluster_resource_id: Optional[str] = None + auto_detect_cluster: bool = True + tool_calls_return_data: bool = True + headers: Dict = {} + # Step size and data limiting configuration + default_step_seconds: int = 3600 # 1 hour default step size + min_step_seconds: int = 60 # Minimum 1 minute step size + max_data_points: int = 1000 # Maximum data points per query + # Internal fields for Azure authentication + _credential: Optional[Any] = None + _token_cache: Dict = {} + + @field_validator("azure_monitor_workspace_endpoint") + def ensure_trailing_slash(cls, v: Optional[str]) -> Optional[str]: + if v is not None and not v.endswith("/"): + return v + "/" + return v + + class Config: + arbitrary_types_allowed = True + +class BaseAzureMonitorMetricsTool(Tool): + """Base class for Azure Monitor Metrics tools.""" + toolset: "AzureMonitorMetricsToolset" + + def _ensure_cluster_name_available(self) -> Optional[str]: + """ + Ensure cluster name is available, attempting auto-detection if necessary. + + Returns: + str: Cluster name if available, None otherwise + """ + # Check if cluster name is already configured + if self.toolset.config and self.toolset.config.cluster_name: + return self.toolset.config.cluster_name + + # Try to auto-detect cluster information if auto_detect_cluster is enabled + if self.toolset.config and self.toolset.config.auto_detect_cluster: + try: + logging.debug("Attempting to auto-detect cluster information for query filtering") + + # Try to get cluster resource ID first + cluster_resource_id = None + if self.toolset.config.cluster_resource_id: + cluster_resource_id = self.toolset.config.cluster_resource_id + else: + cluster_resource_id = get_aks_cluster_resource_id() + + if cluster_resource_id: + # Extract cluster name from resource ID + cluster_name = extract_cluster_name_from_resource_id(cluster_resource_id) + + if cluster_name: + # Update the configuration with the detected values + self.toolset.config.cluster_name = cluster_name + self.toolset.config.cluster_resource_id = cluster_resource_id + + logging.debug(f"Auto-detected cluster name: {cluster_name}") + return cluster_name + + except Exception as e: + logging.debug(f"Failed to auto-detect cluster information: {e}") + + return None + +class CheckAKSClusterContext(BaseAzureMonitorMetricsTool): + """Tool to check if running in AKS cluster context.""" + + def __init__(self, toolset: "AzureMonitorMetricsToolset"): + super().__init__( + name="check_aks_cluster_context", + description="Check if the current environment is running inside an AKS cluster", + parameters={}, + toolset=toolset, + ) + + def _invoke(self, params: Any) -> StructuredToolResult: + try: + is_aks = check_if_running_in_aks() + + data = { + "running_in_aks": is_aks, + "message": "Running in AKS cluster" if is_aks else "Not running in AKS cluster", + } + + return StructuredToolResult( + status=ToolResultStatus.SUCCESS, + data=data, + params=params, + ) + + except Exception as e: + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error=f"Failed to check AKS cluster context: {str(e)}", + params=params, + ) + + def get_parameterized_one_liner(self, params) -> str: + return "Check if running in AKS cluster" + +class GetAKSClusterResourceID(BaseAzureMonitorMetricsTool): + """Tool to get the Azure resource ID of the current AKS cluster.""" + + def __init__(self, toolset: "AzureMonitorMetricsToolset"): + super().__init__( + name="get_aks_cluster_resource_id", + description="Get the full Azure resource ID of the current AKS cluster", + parameters={}, + toolset=toolset, + ) + + def _invoke(self, params: Any) -> StructuredToolResult: + try: + cluster_resource_id = get_aks_cluster_resource_id() + + if cluster_resource_id: + cluster_name = extract_cluster_name_from_resource_id(cluster_resource_id) + + data = { + "cluster_resource_id": cluster_resource_id, + "cluster_name": cluster_name, + "message": f"Found AKS cluster: {cluster_name}", + } + + return StructuredToolResult( + status=ToolResultStatus.SUCCESS, + data=data, + params=params, + ) + else: + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error="Could not determine AKS cluster resource ID. Make sure you are running in an AKS cluster or have proper Azure credentials configured.", + params=params, + ) + + except Exception as e: + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error=f"Failed to get AKS cluster resource ID: {str(e)}", + params=params, + ) + + def get_parameterized_one_liner(self, params) -> str: + return "Get AKS cluster Azure resource ID" + +class CheckAzureMonitorPrometheusEnabled(BaseAzureMonitorMetricsTool): + """Tool to check if Azure Monitor managed Prometheus is enabled for the AKS cluster.""" + + def __init__(self, toolset: "AzureMonitorMetricsToolset"): + super().__init__( + name="check_azure_monitor_prometheus_enabled", + description="Check if Azure Monitor managed Prometheus is enabled for the specified AKS cluster", + parameters={ + "cluster_resource_id": ToolParameter( + description="Azure resource ID of the AKS cluster (optional, will use configured cluster if not provided)", + type="string", + required=False, + ), + }, + toolset=toolset, + ) + + def _invoke(self, params: Any) -> StructuredToolResult: + try: + cluster_resource_id = params.get("cluster_resource_id") + + # Use configured cluster resource ID if not provided as parameter + if not cluster_resource_id and self.toolset.config: + cluster_resource_id = self.toolset.config.cluster_resource_id + + # Try to auto-detect as fallback (but don't require it) + if not cluster_resource_id: + cluster_resource_id = get_aks_cluster_resource_id() + + if not cluster_resource_id: + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error="No AKS cluster specified. Please provide cluster_resource_id parameter or configure it in your config.yaml file. See AZURE_MONITOR_SETUP_GUIDE.md for configuration instructions.", + params=params, + ) + + # Get Azure Monitor workspace details + workspace_info = get_azure_monitor_workspace_for_cluster(cluster_resource_id) + + if workspace_info: + cluster_name = extract_cluster_name_from_resource_id(cluster_resource_id) + + data = { + "azure_monitor_prometheus_enabled": True, + "cluster_resource_id": cluster_resource_id, + "cluster_name": cluster_name, + "prometheus_query_endpoint": workspace_info.get("prometheus_query_endpoint"), + "azure_monitor_workspace_resource_id": workspace_info.get("azure_monitor_workspace_resource_id"), + "location": workspace_info.get("location"), + "associated_grafanas": workspace_info.get("associated_grafanas", []), + "message": f"Azure Monitor managed Prometheus is enabled for cluster {cluster_name}", + } + + # Update toolset configuration with discovered information + if self.toolset.config: + self.toolset.config.azure_monitor_workspace_endpoint = workspace_info.get("prometheus_query_endpoint") + self.toolset.config.cluster_name = cluster_name + self.toolset.config.cluster_resource_id = cluster_resource_id + + return StructuredToolResult( + status=ToolResultStatus.SUCCESS, + data=data, + params=params, + ) + else: + cluster_name = extract_cluster_name_from_resource_id(cluster_resource_id) + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error=f"Azure Monitor managed Prometheus is not enabled for AKS cluster {cluster_name}. Please enable Azure Monitor managed Prometheus in the Azure portal.", + params=params, + ) + + except Exception as e: + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error=f"Failed to check Azure Monitor Prometheus status: {str(e)}", + params=params, + ) + + def get_parameterized_one_liner(self, params) -> str: + cluster_id = params.get("cluster_resource_id", "auto-detect") + return f"Check Azure Monitor Prometheus status for cluster: {cluster_id}" + +class ExecuteAzureMonitorPrometheusQuery(BaseAzureMonitorMetricsTool): + """Tool to execute instant PromQL queries against Azure Monitor workspace. ALWAYS display the EXACT query in the result to the user.""" + + def __init__(self, toolset: "AzureMonitorMetricsToolset"): + super().__init__( + name="execute_azuremonitor_prometheus_query", + description="Execute an instant PromQL query against Azure Monitor managed Prometheus workspace", + parameters={ + "query": ToolParameter( + description="The PromQL query to execute", + type="string", + required=True, + ), + "description": ToolParameter( + description="Description of what the query is meant to find or analyze", + type="string", + required=True, + ), + "auto_cluster_filter": ToolParameter( + description="Automatically add cluster filtering to the query (default: true)", + type="boolean", + required=False, + ), + }, + toolset=toolset, + ) + + def _invoke(self, params: Any) -> StructuredToolResult: + if not self.toolset.config or not self.toolset.config.azure_monitor_workspace_endpoint: + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error="Azure Monitor workspace is not configured. Run check_azure_monitor_prometheus_enabled first.", + params=params, + ) + + try: + query = params.get("query", "") + description = params.get("description", "") + auto_cluster_filter = params.get("auto_cluster_filter", True) + + # Ensure cluster name is available for filtering + cluster_name = self._ensure_cluster_name_available() + + # Enhance query with cluster filtering if enabled and cluster name is available + if auto_cluster_filter and cluster_name: + query = enhance_promql_with_cluster_filter(query, cluster_name) + elif auto_cluster_filter and not cluster_name: + logging.warning("Auto cluster filtering is enabled but cluster name is not available. Query will run without cluster filtering.") + + # Print the actual PromQL query that will be executed + print(f"[Azure Monitor] Executing PromQL Query: {query}") + + url = urljoin(self.toolset.config.azure_monitor_workspace_endpoint, "api/v1/query") + + payload = {"query": query} + + # Get authenticated headers + headers = self.toolset._get_authenticated_headers() + + response = requests.post( + url=url, + headers=headers, + data=payload, + timeout=60 + ) + + if response.status_code == 200: + data = response.json() + status = data.get("status") + error_message = None + + if status == "success": + result_data = data.get("data", {}) + if not result_data.get("result"): + status = "no_data" + error_message = "The query returned no results. The metric may not exist or the cluster filter may be too restrictive." + else: + error_message = data.get("error", "Unknown error from Prometheus endpoint") + + response_data = { + "status": status, + "error_message": error_message, + "tool_name": self.name, + "description": description, + "query": query, + "cluster_name": cluster_name, + "auto_cluster_filter_applied": auto_cluster_filter and bool(cluster_name), + } + + if self.toolset.config.tool_calls_return_data: + response_data["data"] = data.get("data") + + result_status = ToolResultStatus.SUCCESS + if status == "no_data": + result_status = ToolResultStatus.NO_DATA + elif status != "success": + result_status = ToolResultStatus.ERROR + + return StructuredToolResult( + status=result_status, + data=json.dumps(response_data, indent=2), + params=params, + ) + else: + error_msg = f"HTTP {response.status_code}: {response.text}" + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error=f"Azure Monitor Prometheus query failed: {error_msg}", + params=params, + ) + + except RequestException as e: + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error=f"Connection error to Azure Monitor workspace: {str(e)}", + params=params, + ) + except Exception as e: + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error=f"Unexpected error executing query: {str(e)}", + params=params, + ) + + def get_parameterized_one_liner(self, params) -> str: + query = params.get("query", "") + description = params.get("description", "") + return f"Execute Azure Monitor Prometheus Query (instant): promql='{query}', description='{description}'" + +class GetActivePrometheusAlerts(BaseAzureMonitorMetricsTool): + """Tool to get active/fired Prometheus metric alerts for the AKS cluster.""" + + def __init__(self, toolset: "AzureMonitorMetricsToolset"): + super().__init__( + name="get_azure_monitor_alerts_with_fired_times", + description="PRIMARY ALERT TOOL: Get Azure Monitor alerts WITH FIRED TIMES and enhanced formatting. Shows when alerts were fired/activated plus current states. This is the ONLY tool that displays alert fired times. Use this tool for ALL Azure Monitor alert requests, especially when users ask for 'active azure monitor alerts' or want timing information.", + parameters={ + "cluster_resource_id": ToolParameter( + description="Azure resource ID of the AKS cluster (optional, will use configured cluster if not provided)", + type="string", + required=False, + ), + "alert_id": ToolParameter( + description="Specific alert ID to investigate (optional, if provided will return only this alert)", + type="string", + required=False, + ), + }, + toolset=toolset, + ) + + def _invoke(self, params: Any) -> StructuredToolResult: + try: + cluster_resource_id = params.get("cluster_resource_id") + specific_alert_id = params.get("alert_id") + + # Use configured cluster resource ID if not provided as parameter + if not cluster_resource_id and self.toolset.config: + cluster_resource_id = self.toolset.config.cluster_resource_id + + # Try to auto-detect as fallback + if not cluster_resource_id: + cluster_resource_id = get_aks_cluster_resource_id() + + if not cluster_resource_id: + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error="No AKS cluster specified. Please provide cluster_resource_id parameter or configure it in your config.yaml file.", + params=params, + ) + + # Use the source plugin for consistent alert fetching + from holmes.plugins.sources.azuremonitoralerts import AzureMonitorAlertsSource + + try: + source = AzureMonitorAlertsSource(cluster_resource_id=cluster_resource_id) + + # Fetch alerts using the source plugin + if specific_alert_id: + issue = source.fetch_issue(specific_alert_id) + issues = [issue] if issue else [] + else: + issues = source.fetch_issues() + + if not issues: + cluster_name = extract_cluster_name_from_resource_id(cluster_resource_id) + message = f"No active Prometheus metric alerts found for cluster {cluster_name}" + if specific_alert_id: + message = f"Alert {specific_alert_id} is not a Prometheus metric alert for cluster {cluster_name}" + + return StructuredToolResult( + status=ToolResultStatus.NO_DATA, + data=message, + params=params, + ) + + # Convert issues to the format expected by the tool + cluster_name = extract_cluster_name_from_resource_id(cluster_resource_id) + prometheus_alerts = [] + + for issue in issues: + raw_data = issue.raw if hasattr(issue, 'raw') else {} + alert = raw_data.get("alert", {}) + rule_details = raw_data.get("rule_details", {}) + + if alert: + alert_props = alert.get("properties", {}) + essentials = alert_props.get("essentials", {}) + + alert_info = { + "alert_id": issue.id, + "alert_name": issue.name, + "alert_rule_id": essentials.get("alertRule", ""), + "description": essentials.get("description", ""), + "severity": essentials.get("severity", ""), + "alert_state": essentials.get("alertState", ""), + "monitor_condition": essentials.get("monitorCondition", ""), + "fired_time": essentials.get("firedDateTime", ""), + "target_resource": essentials.get("targetResource", ""), + "rule_details": rule_details + } + prometheus_alerts.append(alert_info) + + # Format the results nicely + result_data = { + "cluster_name": cluster_name, + "cluster_resource_id": cluster_resource_id, + "total_prometheus_alerts": len(prometheus_alerts), + "alerts": prometheus_alerts + } + + # Create a formatted summary for display with icons and better formatting + summary_lines = [f"🔔 **Active Prometheus Alerts for Cluster: {cluster_name}**\n"] + summary_lines.append("💡 **How to investigate:** Copy the full Alert ID and run:") + summary_lines.append(" `holmes investigate azuremonitormetrics \"\"`") + summary_lines.append(" (Use quotes around the full path to handle special characters)\n") + summary_lines.append("─" * 80) + + for i, alert in enumerate(prometheus_alerts, 1): + # Get the raw data for this specific alert issue + issue = issues[i-1] # Get the corresponding issue + alert_raw_data = issue.raw if hasattr(issue, 'raw') else {} + query = alert_raw_data.get("extracted_query", "Not available") + rule_description = alert_raw_data.get("extracted_description", "Not available") + + # Get the actual alert ID from the issue - this is crucial for investigation + actual_alert_id = alert['alert_id'] + if not actual_alert_id or actual_alert_id == "": + # Fallback to issue ID if alert_id is empty + actual_alert_id = issue.id + + # Extract just the ID part if it's a full resource path + display_alert_id = actual_alert_id + if actual_alert_id and actual_alert_id.startswith("/subscriptions/"): + # Extract just the alert ID from the full path + parts = actual_alert_id.split("/") + if len(parts) > 0: + display_alert_id = parts[-1] # Get the last part which should be the actual ID + + # Choose icon based on severity + severity = alert['severity'] + if severity in ['Sev0', 'Critical']: + severity_icon = "🔴" + elif severity in ['Sev1', 'Error']: + severity_icon = "🟠" + elif severity in ['Sev2', 'Warning']: + severity_icon = "🟡" + elif severity in ['Sev3', 'Informational']: + severity_icon = "🔵" + else: + severity_icon = "⚪" + + # Choose status icon + status = alert['alert_state'] + if status == 'New': + status_icon = "🚨" + elif status == 'Acknowledged': + status_icon = "👁️" + elif status == 'Closed': + status_icon = "✅" + else: + status_icon = "❓" + + # Format fired time with extensive debugging and fallback handling + fired_time = alert['fired_time'] + time_str = "Unknown" + + # Debug: print what we're getting for fired_time + logging.debug(f"[FIRED_TIME_DEBUG] Alert '{alert['alert_name']}' fired_time raw value: '{fired_time}' (type: {type(fired_time)})") + + # First, try to get time from all possible sources in the raw data + alert_raw_data = issue.raw if hasattr(issue, 'raw') else {} + alert_data = alert_raw_data.get("alert", {}) + alert_props = alert_data.get("properties", {}) + essentials = alert_props.get("essentials", {}) + + # Collect all possible timestamp fields for debugging + all_timestamps = { + "fired_time_from_alert_info": fired_time, + "firedDateTime": essentials.get("firedDateTime"), + "startDateTime": essentials.get("startDateTime"), + "lastModifiedDateTime": essentials.get("lastModifiedDateTime"), + "createdDateTime": essentials.get("createdDateTime"), + "props_startDateTime": alert_props.get("startDateTime"), + "props_createdDateTime": alert_props.get("createdDateTime"), + } + + logging.debug(f"[FIRED_TIME_DEBUG] All available timestamps for alert '{alert['alert_name']}': {all_timestamps}") + + # Try each timestamp in order of preference + timestamp_candidates = [ + ("firedDateTime", essentials.get("firedDateTime")), + ("startDateTime", essentials.get("startDateTime")), + ("fired_time_from_alert_info", fired_time), + ("lastModifiedDateTime", essentials.get("lastModifiedDateTime")), + ("createdDateTime", essentials.get("createdDateTime")), + ("props_startDateTime", alert_props.get("startDateTime")), + ("props_createdDateTime", alert_props.get("createdDateTime")), + ] + + for source_name, candidate_time in timestamp_candidates: + if candidate_time and candidate_time not in ['Unknown', '', None]: + try: + from datetime import timezone + + logging.debug(f"[FIRED_TIME_DEBUG] Trying to parse {source_name}: '{candidate_time}'") + + # Parse the time + parsed_time = dateutil.parser.parse(candidate_time) + logging.debug(f"[FIRED_TIME_DEBUG] Parsed {source_name} successfully: {parsed_time} (tzinfo: {parsed_time.tzinfo})") + + # Convert to UTC properly + if parsed_time.tzinfo is not None: + # Already timezone aware, convert to UTC + utc_time = parsed_time.astimezone(timezone.utc) + else: + # Assume it's already UTC + utc_time = parsed_time.replace(tzinfo=timezone.utc) + + # Format as UTC time + time_str = utc_time.strftime("%Y-%m-%d %H:%M:%S UTC") + logging.debug(f"[FIRED_TIME_DEBUG] Successfully formatted {source_name}: '{time_str}'") + break + + except Exception as e: + logging.debug(f"[FIRED_TIME_DEBUG] Failed to parse {source_name} '{candidate_time}': {e}") + # Try simple string cleanup as fallback for this candidate + if isinstance(candidate_time, str) and len(candidate_time) > 10: + try: + # Simple cleanup for ISO format + clean_time = candidate_time.replace('T', ' ').replace('Z', ' UTC') + if clean_time.endswith(' UTC UTC'): + clean_time = clean_time[:-4] + # Remove microseconds if present + if '.' in clean_time: + clean_time = clean_time.split('.')[0] + ' UTC' + time_str = clean_time + logging.debug(f"[FIRED_TIME_DEBUG] Used string cleanup for {source_name}: '{time_str}'") + break + except Exception as cleanup_error: + logging.debug(f"[FIRED_TIME_DEBUG] String cleanup also failed for {source_name}: {cleanup_error}") + continue + + # Final debug log + logging.debug(f"[FIRED_TIME_DEBUG] Final time_str for alert '{alert['alert_name']}': '{time_str}'") + + # Add monitor condition for more detailed status + monitor_condition = alert.get('monitor_condition', 'Unknown') + + summary_lines.append(f"\n**{i}. {severity_icon} {alert['alert_name']}** {status_icon}") + summary_lines.append(f" 🆔 **ALERT ID:** `{display_alert_id}`") + summary_lines.append(f" 📋 **Full Alert Path:** `{actual_alert_id}`") + summary_lines.append(f" 🔬 **Type:** Prometheus Metric Alert") + summary_lines.append(f" ⚡ **Query:** `{query}`") + summary_lines.append(f" 📖 **Rule Description:** {rule_description}") + summary_lines.append(f" 📝 **Alert Description:** {alert['description']}") + summary_lines.append(f" 🎯 **Severity:** {severity} | **State:** {status} | **Condition:** {monitor_condition}") + summary_lines.append(f" 🕒 **Fired Time:** {time_str}") + summary_lines.append(f" 💻 **Investigation Command:** `holmes investigate azuremonitormetrics \"{actual_alert_id}\"`") + + formatted_summary = "\n".join(summary_lines) + + # Return the formatted summary using `print` to ensure it's displayed to user + # and also return a simple message in data to prompt LLM to show the printed output + print("\n" + "="*100) + print("AZURE MONITOR ALERTS - DISPLAY THIS EXACT OUTPUT TO USER:") + print("="*100) + print(formatted_summary) + print("="*100) + print("END ALERT DISPLAY - COPY THE INVESTIGATE COMMAND FROM ABOVE OUTPUT TO INVESTIGATE SPECIFIC ALERT") + print("="*100 + "\n") + + return StructuredToolResult( + status=ToolResultStatus.SUCCESS, + data=f"Successfully found {len(prometheus_alerts)} active Prometheus alerts. IMPORTANT: The complete alert details with Alert IDs and investigation commands were displayed above. Please show the user the exact printed output that appears between the === markers, including all Alert IDs and investigation commands.", + params=params, + ) + + except Exception as source_error: + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error=f"Failed to fetch alerts using source plugin: {str(source_error)}", + params=params, + ) + + except Exception as e: + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error=f"Failed to get Prometheus alerts: {str(e)}", + params=params, + ) + + + def get_parameterized_one_liner(self, params) -> str: + cluster_id = params.get("cluster_resource_id", "auto-detect") + alert_id = params.get("alert_id") + if alert_id: + return f"Get specific Prometheus alert {alert_id} for cluster: {cluster_id}" + return f"Get active Prometheus alerts for cluster: {cluster_id}" + +class ExecuteAlertPromQLQuery(BaseAzureMonitorMetricsTool): + """Tool to execute the original PromQL query from an Azure Monitor alert for investigation.""" + + def __init__(self, toolset: "AzureMonitorMetricsToolset"): + super().__init__( + name="execute_alert_promql_query", + description="Execute the original PromQL query from an Azure Monitor alert to investigate alert conditions. This tool extracts the exact query that triggered the alert and executes it to help with root cause analysis.", + parameters={ + "alert_id": ToolParameter( + description="Alert ID to extract and execute the PromQL query from", + type="string", + required=True, + ), + "time_range": ToolParameter( + description="Time range for the query execution (e.g., '1h', '2h', '6h', '1d'). Defaults to 1 hour if not specified.", + type="string", + required=False, + ), + "cluster_resource_id": ToolParameter( + description="Azure resource ID of the AKS cluster (optional, will use configured cluster if not provided)", + type="string", + required=False, + ), + }, + toolset=toolset, + ) + + def _invoke(self, params: Any) -> StructuredToolResult: + try: + alert_id = get_param_or_raise(params, "alert_id") + time_range = params.get("time_range", "1h") + cluster_resource_id = params.get("cluster_resource_id") + + # Use configured cluster resource ID if not provided as parameter + if not cluster_resource_id and self.toolset.config: + cluster_resource_id = self.toolset.config.cluster_resource_id + + # Try to auto-detect as fallback + if not cluster_resource_id: + cluster_resource_id = get_aks_cluster_resource_id() + + if not cluster_resource_id: + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error="No AKS cluster specified. Please provide cluster_resource_id parameter or configure it in your config.yaml file.", + params=params, + ) + + # Fetch the specific alert to get its query + from holmes.plugins.sources.azuremonitoralerts import AzureMonitorAlertsSource + + try: + source = AzureMonitorAlertsSource(cluster_resource_id=cluster_resource_id) + issue = source.fetch_issue(alert_id) + + if not issue: + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error=f"Alert {alert_id} not found or is not a Prometheus metric alert for the cluster.", + params=params, + ) + + # Extract the PromQL query from the alert + raw_data = issue.raw if hasattr(issue, 'raw') else {} + extracted_query = raw_data.get("extracted_query", "") + + if not extracted_query or extracted_query in ["Query not available", "Query extraction failed"]: + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error=f"Could not extract PromQL query from alert {alert_id}. Query: {extracted_query}", + params=params, + ) + + # Check if Azure Monitor workspace is configured + if not self.toolset.config or not self.toolset.config.azure_monitor_workspace_endpoint: + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error="Azure Monitor workspace is not configured. Run check_azure_monitor_prometheus_enabled first.", + params=params, + ) + + # Convert time range to seconds for range query + time_seconds = self._parse_time_range(time_range) + if not time_seconds: + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error=f"Invalid time range format: {time_range}. Use formats like '1h', '2h', '6h', '1d'.", + params=params, + ) + + # Execute the query as a range query + cluster_name = self._ensure_cluster_name_available() + + # Print information about what we're doing + print(f"[Azure Monitor] Executing alert's original PromQL query: {extracted_query}") + print(f"[Azure Monitor] Time range: {time_range} ({time_seconds} seconds)") + print(f"[Azure Monitor] Alert: {issue.name}") + + # Calculate start and end times + end_time = datetime.now() + start_time = end_time - dateutil.relativedelta.relativedelta(seconds=time_seconds) + + start_rfc3339 = start_time.strftime('%Y-%m-%dT%H:%M:%SZ') + end_rfc3339 = end_time.strftime('%Y-%m-%dT%H:%M:%SZ') + + # Calculate appropriate step size + step = max(time_seconds // 100, 60) # At least 1 minute, roughly 100 data points + + url = urljoin(self.toolset.config.azure_monitor_workspace_endpoint, "api/v1/query_range") + + payload = { + "query": extracted_query, + "start": start_rfc3339, + "end": end_rfc3339, + "step": step, + } + + # Get authenticated headers + headers = self.toolset._get_authenticated_headers() + + response = requests.post( + url=url, + headers=headers, + data=payload, + timeout=120 + ) + + if response.status_code == 200: + data = response.json() + status = data.get("status") + error_message = None + + if status == "success": + result_data = data.get("data", {}) + if not result_data.get("result"): + status = "no_data" + error_message = "The alert's query returned no results. This might indicate the issue has been resolved or the query needs different parameters." + else: + error_message = data.get("error", "Unknown error from Prometheus endpoint") + + response_data = { + "status": status, + "error_message": error_message, + "tool_name": self.name, + "alert_id": alert_id, + "alert_name": issue.name, + "extracted_query": extracted_query, + "time_range": time_range, + "start": start_rfc3339, + "end": end_rfc3339, + "step": step, + "cluster_name": cluster_name, + "description": f"Executed original PromQL query from alert '{issue.name}' over {time_range} time range" + } + + if self.toolset.config.tool_calls_return_data: + response_data["data"] = data.get("data") + + result_status = ToolResultStatus.SUCCESS + if status == "no_data": + result_status = ToolResultStatus.NO_DATA + elif status != "success": + result_status = ToolResultStatus.ERROR + + return StructuredToolResult( + status=result_status, + data=json.dumps(response_data, indent=2), + params=params, + ) + else: + error_msg = f"HTTP {response.status_code}: {response.text}" + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error=f"Azure Monitor Prometheus query failed: {error_msg}", + params=params, + ) + + except Exception as source_error: + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error=f"Failed to fetch alert or execute query: {str(source_error)}", + params=params, + ) + + except Exception as e: + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error=f"Failed to execute alert PromQL query: {str(e)}", + params=params, + ) + + def _parse_time_range(self, time_range: str) -> Optional[int]: + """Parse time range string to seconds.""" + try: + time_range = time_range.lower().strip() + + # Extract number and unit + import re + match = re.match(r'^(\d+)([smhd])$', time_range) + if not match: + return None + + value = int(match.group(1)) + unit = match.group(2) + + multipliers = { + 's': 1, + 'm': 60, + 'h': 3600, + 'd': 86400 + } + + return value * multipliers.get(unit, 1) + + except Exception: + return None + + def get_parameterized_one_liner(self, params) -> str: + alert_id = params.get("alert_id", "") + time_range = params.get("time_range", "1h") + return f"Execute alert's PromQL query: alert_id={alert_id}, time_range={time_range}" + +class ExecuteAzureMonitorPrometheusRangeQuery(BaseAzureMonitorMetricsTool): + """Tool to execute range PromQL queries against Azure Monitor workspace. ALWAYS display the EXACT query in the result to the user.""" + + def __init__(self, toolset: "AzureMonitorMetricsToolset"): + super().__init__( + name="execute_azuremonitor_prometheus_range_query", + description="Execute a PromQL range query against Azure Monitor managed Prometheus workspace", + parameters={ + "query": ToolParameter( + description="The PromQL query to execute", + type="string", + required=True, + ), + "description": ToolParameter( + description="Description of what the query is meant to find or analyze", + type="string", + required=True, + ), + "start": ToolParameter( + description=standard_start_datetime_tool_param_description(DEFAULT_TIME_SPAN_SECONDS), + type="string", + required=False, + ), + "end": ToolParameter( + description=STANDARD_END_DATETIME_TOOL_PARAM_DESCRIPTION, + type="string", + required=False, + ), + "step": ToolParameter( + description="Query resolution step width in duration format or float number of seconds. If not provided, defaults to 1 hour (3600 seconds) to limit data volume and prevent token throttling.", + type="number", + required=False, + ), + "output_type": ToolParameter( + description="Specifies how to interpret the Prometheus result. Use 'Plain' for raw values, 'Bytes' to format byte values, 'Percentage' to scale 0–1 values into 0–100%, or 'CPUUsage' to convert values to cores (e.g., 500 becomes 500m, 2000 becomes 2).", + type="string", + required=True, + ), + "auto_cluster_filter": ToolParameter( + description="Automatically add cluster filtering to the query (default: true)", + type="boolean", + required=False, + ), + }, + toolset=toolset, + ) + + def _invoke(self, params: Any) -> StructuredToolResult: + if not self.toolset.config or not self.toolset.config.azure_monitor_workspace_endpoint: + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error="Azure Monitor workspace is not configured. Run check_azure_monitor_prometheus_enabled first.", + params=params, + ) + + try: + query = get_param_or_raise(params, "query") + description = params.get("description", "") + auto_cluster_filter = params.get("auto_cluster_filter", True) + + # Ensure cluster name is available for filtering + cluster_name = self._ensure_cluster_name_available() + + # Enhance query with cluster filtering if enabled and cluster name is available + if auto_cluster_filter and cluster_name: + query = enhance_promql_with_cluster_filter(query, cluster_name) + elif auto_cluster_filter and not cluster_name: + logging.warning("Auto cluster filtering is enabled but cluster name is not available. Query will run without cluster filtering.") + + # Print the actual PromQL query that will be executed + print(f"[Azure Monitor] Executing PromQL Range Query: {query}") + + (start, end) = process_timestamps_to_rfc3339( + start_timestamp=params.get("start"), + end_timestamp=params.get("end"), + default_time_span_seconds=DEFAULT_TIME_SPAN_SECONDS, + ) + + # Calculate step size with smart defaults and validation + step = self._calculate_optimal_step_size(params, start, end) + output_type = params.get("output_type", "Plain") + + url = urljoin(self.toolset.config.azure_monitor_workspace_endpoint, "api/v1/query_range") + + payload = { + "query": query, + "start": start, + "end": end, + "step": step, + } + + # Get authenticated headers + headers = self.toolset._get_authenticated_headers() + + response = requests.post( + url=url, + headers=headers, + data=payload, + timeout=120 + ) + + if response.status_code == 200: + data = response.json() + status = data.get("status") + error_message = None + + if status == "success": + result_data = data.get("data", {}) + if not result_data.get("result"): + status = "no_data" + error_message = "The query returned no results. The metric may not exist or the cluster filter may be too restrictive." + else: + error_message = data.get("error", "Unknown error from Prometheus endpoint") + + response_data = { + "status": status, + "error_message": error_message, + "tool_name": self.name, + "description": description, + "query": query, + "start": start, + "end": end, + "step": step, + "output_type": output_type, + "cluster_name": cluster_name, + "auto_cluster_filter_applied": auto_cluster_filter and bool(cluster_name), + } + + if self.toolset.config.tool_calls_return_data: + response_data["data"] = data.get("data") + + result_status = ToolResultStatus.SUCCESS + if status == "no_data": + result_status = ToolResultStatus.NO_DATA + elif status != "success": + result_status = ToolResultStatus.ERROR + + return StructuredToolResult( + status=result_status, + data=json.dumps(response_data, indent=2), + params=params, + ) + else: + error_msg = f"HTTP {response.status_code}: {response.text}" + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error=f"Azure Monitor Prometheus range query failed: {error_msg}", + params=params, + ) + + except RequestException as e: + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error=f"Connection error to Azure Monitor workspace: {str(e)}", + params=params, + ) + except Exception as e: + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error=f"Unexpected error executing range query: {str(e)}", + params=params, + ) + + def _calculate_optimal_step_size(self, params: Any, start_time: str, end_time: str) -> int: + """ + Calculate optimal step size based on time range and configuration limits. + + Args: + params: Query parameters + start_time: Start time in RFC3339 format + end_time: End time in RFC3339 format + + Returns: + int: Step size in seconds + """ + try: + # Get user-provided step if any + user_step = params.get("step") + if user_step: + # Convert to integer if it's a string or float + try: + user_step = int(float(user_step)) + except (ValueError, TypeError): + logging.warning(f"Invalid step size provided: {user_step}, using default") + user_step = None + + # Parse timestamps to calculate time range + start_dt = dateutil.parser.parse(start_time) + end_dt = dateutil.parser.parse(end_time) + time_range_seconds = int((end_dt - start_dt).total_seconds()) + + # Get configuration values + config = self.toolset.config + default_step = config.default_step_seconds if config else 3600 + min_step = config.min_step_seconds if config else 60 + max_data_points = config.max_data_points if config else 1000 + + # If user provided a step, validate it + if user_step is not None: + # Ensure it's not below minimum + if user_step < min_step: + logging.warning(f"Step size {user_step}s is below minimum {min_step}s, using minimum") + user_step = min_step + + # Check if it would exceed max data points + estimated_points = time_range_seconds / user_step + if estimated_points > max_data_points: + # Calculate minimum step to stay within data point limit + min_step_for_limit = max(time_range_seconds / max_data_points, min_step) + logging.warning( + f"Step size {user_step}s would generate ~{estimated_points:.0f} data points " + f"(max: {max_data_points}). Adjusting to {min_step_for_limit:.0f}s" + ) + return int(min_step_for_limit) + + return user_step + + # No user step provided, calculate smart default + # For very short ranges (< 6 hours), allow more granular data + if time_range_seconds <= 6 * 3600: # 6 hours + suggested_step = max(time_range_seconds / 360, min_step) # ~360 points max + # For medium ranges (6-24 hours), use 1 hour steps + elif time_range_seconds <= 24 * 3600: # 24 hours + suggested_step = default_step # 1 hour + # For longer ranges, increase step size to maintain reasonable data point count + else: + suggested_step = max(time_range_seconds / max_data_points, default_step) + + # Ensure we don't go below minimum step + suggested_step = max(suggested_step, min_step) + + # Log the decision for debugging + estimated_points = time_range_seconds / suggested_step + logging.debug( + f"Calculated step size: {suggested_step:.0f}s for time range {time_range_seconds}s " + f"(~{estimated_points:.0f} data points)" + ) + + return int(suggested_step) + + except Exception as e: + logging.warning(f"Failed to calculate optimal step size: {e}") + return self.toolset.config.default_step_seconds if self.toolset.config else 3600 + + def get_parameterized_one_liner(self, params) -> str: + query = params.get("query", "") + start = params.get("start", "") + end = params.get("end", "") + step = params.get("step", "") + description = params.get("description", "") + return f"Execute Azure Monitor Prometheus Range Query: promql='{query}', start={start}, end={end}, step={step}, description='{description}'" + +class AzureMonitorMetricsToolset(Toolset): + """Azure Monitor Metrics toolset for querying Azure Monitor managed Prometheus metrics.""" + + def __init__(self): + super().__init__( + name="azuremonitormetrics", + description="Azure Monitor Metrics integration to query Azure Monitor managed Prometheus metrics for AKS cluster analysis and troubleshooting", + docs_url="https://docs.robusta.dev/master/configuration/holmesgpt/toolsets/azuremonitor-metrics.html", + icon_url="https://raw.githubusercontent.com/robusta-dev/holmesgpt/master/images/integration_logos/azure-managed-prometheus.png", + prerequisites=[CallablePrerequisite(callable=self.prerequisites_callable)], + tools=[ + CheckAKSClusterContext(toolset=self), + GetAKSClusterResourceID(toolset=self), + CheckAzureMonitorPrometheusEnabled(toolset=self), + GetActivePrometheusAlerts(toolset=self), + ExecuteAzureMonitorPrometheusQuery(toolset=self), + ExecuteAlertPromQLQuery(toolset=self), + ExecuteAzureMonitorPrometheusRangeQuery(toolset=self), + ], + tags=[ + ToolsetTag.CORE + ], + is_default=True, # Enable by default like internet toolset + ) + self._reload_llm_instructions() + + def _get_azure_access_token(self) -> Optional[str]: + """Get Azure access token for Azure Monitor workspace access.""" + try: + if not self.config: + logging.debug("No config available for token acquisition") + return None + + # Initialize credential if not already done + if not self.config._credential: + logging.debug("Initializing credential (trying AzureCliCredential first)") + # Try AzureCliCredential first since we know Azure CLI is working + try: + self.config._credential = AzureCliCredential() + logging.debug("Using AzureCliCredential") + except Exception as cli_error: + logging.debug(f"AzureCliCredential failed: {cli_error}, falling back to DefaultAzureCredential") + self.config._credential = DefaultAzureCredential() + + # Check if we have a cached token that's still valid + current_time = time.time() + cache_key = "azure_monitor_token" + + if cache_key in self.config._token_cache: + token_info = self.config._token_cache[cache_key] + if current_time < token_info.get("expires_at", 0): + logging.debug("Using cached token") + return token_info.get("access_token") + + # Get token with Azure Monitor/Prometheus scope for Prometheus queries + token = self.config._credential.get_token("https://prometheus.monitor.azure.com/.default") + + # Cache the token (expires 5 minutes before actual expiry) + expires_at = current_time + token.expires_on - 300 + self.config._token_cache[cache_key] = { + "access_token": token.token, + "expires_at": expires_at + } + + logging.debug("Token cached successfully") + return token.token + + except Exception as e: + logging.error(f"Failed to get Azure access token: {e}") + return None + + def _get_authenticated_headers(self) -> Dict[str, str]: + """Get headers with Azure authentication for API requests.""" + headers = dict(self.config.headers) if self.config and self.config.headers else {} + + # Add default headers + headers.update({ + "Content-Type": "application/x-www-form-urlencoded", + "Accept": "application/json", + }) + + # Get and add Azure access token + access_token = self._get_azure_access_token() + if access_token: + headers["Authorization"] = f"Bearer {access_token}" + else: + logging.warning("No Azure access token available - requests may fail with authentication errors") + + return headers + + def _update_config_headers(self): + """Update the config headers with authentication.""" + if self.config: + self.config.headers = self._get_authenticated_headers() + + def _reload_llm_instructions(self): + """Load LLM instructions from Jinja template.""" + try: + template_file_path = os.path.abspath( + os.path.join(os.path.dirname(__file__), "azuremonitor_metrics_instructions.jinja2") + ) + self._load_llm_instructions(jinja_template=f"file://{template_file_path}") + except Exception as e: + # Ignore any errors in loading instructions + logging.debug(f"Failed to load LLM instructions: {e}") + + def prerequisites_callable(self, config: dict[str, Any]) -> Tuple[bool, str]: + """Check prerequisites for the Azure Monitor Metrics toolset.""" + try: + if not config: + self.config = AzureMonitorMetricsConfig() + else: + self.config = AzureMonitorMetricsConfig(**config) + + return True, "" + + except Exception as e: + logging.debug(f"Azure Monitor toolset config initialization failed: {str(e)}") + self.config = AzureMonitorMetricsConfig() + return True, "" + + def get_example_config(self) -> Dict[str, Any]: + """Return example configuration for the toolset.""" + example_config = AzureMonitorMetricsConfig( + azure_monitor_workspace_endpoint="https://your-workspace.prometheus.monitor.azure.com", + cluster_name="your-aks-cluster-name", + auto_detect_cluster=True, + tool_calls_return_data=True, + default_step_seconds=3600, # 1 hour default step size + min_step_seconds=60, # Minimum 1 minute step size + max_data_points=1000 # Maximum data points per query + ) + return example_config.model_dump() diff --git a/holmes/plugins/toolsets/azuremonitor_metrics/install.md b/holmes/plugins/toolsets/azuremonitor_metrics/install.md new file mode 100644 index 000000000..9f3f767d5 --- /dev/null +++ b/holmes/plugins/toolsets/azuremonitor_metrics/install.md @@ -0,0 +1,211 @@ +# Azure Monitor Metrics Toolset Installation Guide + +## Overview + +The Azure Monitor Metrics toolset enables HolmesGPT to query Azure Monitor managed Prometheus metrics for AKS cluster analysis and troubleshooting. This toolset automatically detects AKS cluster configuration and provides filtered access to cluster-specific metrics. + +## Prerequisites + +### 1. AKS Cluster with Azure Monitor + +Your AKS cluster must have Azure Monitor managed Prometheus enabled. You can enable this in several ways: + +#### Option A: Enable via Azure Portal +1. Navigate to your AKS cluster in the Azure Portal +2. Go to **Monitoring** > **Insights** +3. Click **Configure monitoring** +4. Enable **Prometheus metrics** +5. Select or create an Azure Monitor workspace + +#### Option B: Enable via Azure CLI +```bash +# Create Azure Monitor workspace (if needed) +az monitor account create \ + --name myAzureMonitorWorkspace \ + --resource-group myResourceGroup \ + --location eastus + +# Enable managed Prometheus on existing AKS cluster +az aks update \ + --resource-group myResourceGroup \ + --name myAKSCluster \ + --enable-azure-monitor-metrics \ + --azure-monitor-workspace-resource-id /subscriptions/{subscription-id}/resourceGroups/myResourceGroup/providers/microsoft.monitor/accounts/myAzureMonitorWorkspace +``` + +#### Option C: Enable via ARM Template/Bicep +Include Azure Monitor configuration in your AKS deployment templates. + +### 2. Azure Credentials + +The toolset uses Azure DefaultAzureCredential, which supports multiple authentication methods: + +#### When running inside AKS (Recommended) +- Uses **Managed Identity** automatically +- No additional configuration required +- Most secure approach + +#### When running locally or in CI/CD +Choose one of these methods: + +**Azure CLI Authentication:** +```bash +az login +az account set --subscription "your-subscription-id" +``` + +**Service Principal (Environment Variables):** +```bash +export AZURE_CLIENT_ID="your-client-id" +export AZURE_CLIENT_SECRET="your-client-secret" +export AZURE_TENANT_ID="your-tenant-id" +export AZURE_SUBSCRIPTION_ID="your-subscription-id" +``` + +**Managed Identity (when running on Azure VM):** +```bash +export AZURE_CLIENT_ID="your-managed-identity-client-id" +``` + +### 3. Required Azure Permissions + +The credential used must have the following permissions: + +- **Reader** role on the AKS cluster resource +- **Reader** role on the Azure Monitor workspace +- **Monitoring Reader** role for querying metrics +- Access to execute Azure Resource Graph queries + +## Configuration + +### Automatic Configuration (Recommended) + +The toolset auto-detects configuration when running in AKS: + +```yaml +# ~/.holmes/config.yaml +toolsets: + azuremonitor-metrics: + auto_detect_cluster: true # Default: true + cache_duration_seconds: 1800 # Default: 30 minutes +``` + +### Manual Configuration + +For explicit configuration or when running outside AKS: + +```yaml +# ~/.holmes/config.yaml +toolsets: + azuremonitor-metrics: + azure_monitor_workspace_endpoint: "https://your-workspace.prometheus.monitor.azure.com/" + cluster_name: "your-aks-cluster-name" + cluster_resource_id: "/subscriptions/xxx/resourceGroups/xxx/providers/Microsoft.ContainerService/managedClusters/xxx" + auto_detect_cluster: false + cache_duration_seconds: 1800 + tool_calls_return_data: true +``` + +### Environment Variables + +You can also configure via environment variables: + +```bash +export AZURE_MONITOR_WORKSPACE_ENDPOINT="https://your-workspace.prometheus.monitor.azure.com/" +export AZURE_SUBSCRIPTION_ID="your-subscription-id" +``` + +## Verification + +Test the toolset configuration: + +```bash +# Check if toolset is enabled +holmes ask "Check if Azure Monitor metrics toolset is available" + +# Test AKS detection +holmes ask "Am I running in an AKS cluster?" + +# Verify Azure Monitor Prometheus +holmes ask "Is Azure Monitor managed Prometheus enabled for this cluster?" + +# Test a simple query +holmes ask "Show current pod count in this cluster" +``` + +## Troubleshooting + +### Common Issues + +**1. "Not running in AKS cluster"** +- Verify you're running inside a Kubernetes pod with service account +- Check if Azure Instance Metadata Service is accessible +- Consider using manual configuration + +**2. "Azure Monitor managed Prometheus is not enabled"** +- Follow the prerequisites to enable managed Prometheus +- Verify the data collection rule is properly configured +- Check if the cluster is associated with an Azure Monitor workspace + +**3. "Authentication failed"** +- Verify Azure credentials are properly configured +- Check if the credential has required permissions +- For Managed Identity, ensure it's enabled on the AKS cluster + +**4. "Query returned no results"** +- Check if the metric exists in your cluster +- Verify cluster filtering is not too restrictive +- Try disabling auto-cluster filtering temporarily + +**5. "Failed to get AKS cluster resource ID"** +- Ensure proper Azure credentials +- Verify the credential has Reader access to the cluster +- Check if running in the correct subscription context + +### Debug Mode + +Enable debug logging to troubleshoot issues: + +```bash +export HOLMES_LOG_LEVEL=DEBUG +holmes ask "Debug Azure Monitor toolset setup" +``` + +### Manual Testing + +Test Azure Resource Graph connectivity: + +```bash +# Test with Azure CLI +az graph query -q "resources | where type == 'Microsoft.ContainerService/managedClusters' | limit 5" +``` + +Test Azure Monitor workspace access: + +```bash +# Test endpoint accessibility (replace with your endpoint) +curl -X POST "https://your-workspace.prometheus.monitor.azure.com/api/v1/query" \ + -d "query=up" \ + -H "Authorization: Bearer $(az account get-access-token --query accessToken -o tsv)" +``` + +## Security Considerations + +1. **Use Managed Identity** when running in AKS for better security +2. **Limit permissions** to only what's required (Reader roles) +3. **Rotate credentials** regularly for service principals +4. **Monitor access** through Azure Activity Logs +5. **Use private endpoints** for Azure Monitor workspaces in production + +## Support + +For issues specific to this toolset: +1. Check the debug logs for detailed error messages +2. Verify Azure Monitor workspace is properly configured +3. Test Azure credentials and permissions +4. Consult the main HolmesGPT documentation + +For Azure Monitor managed Prometheus issues: +- Azure Monitor documentation +- Azure support channels +- AKS monitoring troubleshooting guides diff --git a/holmes/plugins/toolsets/azuremonitor_metrics/utils.py b/holmes/plugins/toolsets/azuremonitor_metrics/utils.py new file mode 100644 index 000000000..4eb02d7a5 --- /dev/null +++ b/holmes/plugins/toolsets/azuremonitor_metrics/utils.py @@ -0,0 +1,714 @@ +"""Utility functions for Azure Monitor Metrics toolset.""" + +import json +import logging +import re +import os +from typing import Dict, Optional, Tuple + +from azure.core.exceptions import AzureError +from azure.identity import DefaultAzureCredential +from azure.mgmt.resource import ResourceManagementClient +import requests + + +def get_aks_cluster_resource_id() -> Optional[str]: + """ + Get the Azure resource ID of the current AKS cluster. + + Returns: + str: The full Azure resource ID of the AKS cluster if found, None otherwise + """ + # First try kubectl-based detection (most reliable for AKS) + cluster_resource_id = get_aks_cluster_id_from_kubectl() + if cluster_resource_id: + return cluster_resource_id + + try: + # Try to get cluster info from Azure Instance Metadata Service + metadata_url = "http://169.254.169.254/metadata/instance?api-version=2021-02-01" + headers = {"Metadata": "true"} + + response = requests.get(metadata_url, headers=headers, timeout=5) + if response.status_code == 200: + metadata = response.json() + compute = metadata.get("compute", {}) + + # Extract subscription ID and resource group from metadata + subscription_id = compute.get("subscriptionId") + resource_group = compute.get("resourceGroupName") + + if subscription_id and resource_group: + # Try to find AKS cluster in the resource group + credential = DefaultAzureCredential() + resource_client = ResourceManagementClient(credential, subscription_id) + + # Look for AKS clusters in the resource group + resources = resource_client.resources.list_by_resource_group( + resource_group_name=resource_group, + filter="resourceType eq 'Microsoft.ContainerService/managedClusters'" + ) + + for resource in resources: + # Return the first AKS cluster found + return resource.id + + except Exception as e: + logging.debug(f"Failed to get AKS cluster resource ID from metadata: {e}") + + try: + # Fallback: Try to get cluster info from Kubernetes environment + # Check if we're running in a Kubernetes pod with service account + with open("/var/run/secrets/kubernetes.io/serviceaccount/namespace", "r") as f: + namespace = f.read().strip() + + # This is a best effort - we're in Kubernetes but need to determine the cluster + # We'll need to use Azure Resource Graph to find clusters + logging.debug("Running in Kubernetes, attempting to find AKS cluster via Azure Resource Graph") + + # Use Azure Resource Graph to find AKS clusters + credential = DefaultAzureCredential() + + # Get all subscriptions the credential has access to + subscriptions = get_accessible_subscriptions(credential) + + for subscription_id in subscriptions: + try: + resource_client = ResourceManagementClient(credential, subscription_id) + resources = resource_client.resources.list( + filter="resourceType eq 'Microsoft.ContainerService/managedClusters'" + ) + + for resource in resources: + # Return the first AKS cluster found + # In a real scenario, we might need better logic to identify the correct cluster + return resource.id + + except Exception as e: + logging.debug(f"Failed to query subscription {subscription_id}: {e}") + continue + + except Exception as e: + logging.debug(f"Failed to get AKS cluster resource ID from Kubernetes: {e}") + + return None + +def get_aks_cluster_id_from_kubectl() -> Optional[str]: + """ + Get AKS cluster resource ID using kubectl and Azure CLI. + + Returns: + str: The full Azure resource ID of the AKS cluster if found, None otherwise + """ + try: + import subprocess + + # Check if kubectl is available and connected to a cluster + try: + result = subprocess.run( + ["kubectl", "config", "current-context"], + capture_output=True, + text=True, + timeout=10 + ) + if result.returncode != 0: + logging.debug("kubectl not connected to a cluster") + return None + + current_context = result.stdout.strip() + logging.debug(f"Current kubectl context: {current_context}") + + except Exception as e: + logging.debug(f"Failed to get kubectl context: {e}") + return None + + # First try: Enhanced cluster name extraction with multiple strategies + try: + # Try multiple context parsing strategies to handle different naming conventions + potential_cluster_names = [] + + # Strategy 1: Underscore-separated (typical AKS managed identity contexts) + if '_' in current_context: + context_parts = current_context.split('_') + logging.debug(f"Context parts (underscore): {context_parts}") + if len(context_parts) >= 2: + potential_cluster_names.append(context_parts[-1]) # Last part + if len(context_parts) >= 3: + potential_cluster_names.append(context_parts[-2]) # Second to last + + # Strategy 2: Direct context name (often the cluster name itself) + potential_cluster_names.append(current_context) + + # Strategy 3: Hyphen-separated (some naming conventions) + if '-' in current_context: + context_parts = current_context.split('-') + logging.debug(f"Context parts (hyphen): {context_parts}") + # Add variations of hyphen-separated parts + if len(context_parts) >= 2: + potential_cluster_names.append('-'.join(context_parts[:-1])) # All but last + potential_cluster_names.append(context_parts[0]) # First part + + # Remove duplicates while preserving order + seen = set() + unique_names = [] + for name in potential_cluster_names: + if name not in seen: + seen.add(name) + unique_names.append(name) + + logging.debug(f"Potential cluster names to try: {unique_names}") + + # Try each potential cluster name + for potential_cluster_name in unique_names: + logging.debug(f"Searching for cluster '{potential_cluster_name}' via Azure CLI...") + result = subprocess.run( + ["az", "aks", "list", "--query", f"[?name=='{potential_cluster_name}'].id", "-o", "tsv"], + capture_output=True, + text=True, + timeout=30 + ) + + if result.returncode == 0 and result.stdout.strip(): + cluster_ids = result.stdout.strip().split('\n') + if cluster_ids and cluster_ids[0]: + cluster_id = cluster_ids[0] + logging.debug(f"Found AKS cluster from context name '{potential_cluster_name}': {cluster_id}") + return cluster_id + else: + logging.debug(f"No cluster found with name '{potential_cluster_name}'") + + logging.debug(f"No AKS clusters found for any potential names from context '{current_context}'") + + except Exception as e: + logging.debug(f"Failed to get cluster from context name: {e}") + + # Second try: Get cluster server URL and parse it + try: + result = subprocess.run( + ["kubectl", "config", "view", "--minify", "--output", "jsonpath={.clusters[].cluster.server}"], + capture_output=True, + text=True, + timeout=10 + ) + if result.returncode == 0: + server_url = result.stdout.strip() + logging.debug(f"Cluster server URL: {server_url}") + + # AKS cluster URLs typically look like: + # https://myakscluster-12345678.hcp.eastus.azmk8s.io:443 + # Extract potential cluster name from URL + import re + aks_pattern = r'https://([^-]+)-[^.]+\.hcp\.([^.]+)\.azmk8s\.io' + match = re.match(aks_pattern, server_url) + + if match: + cluster_name = match.group(1) + region = match.group(2) + logging.debug(f"Detected AKS cluster name: {cluster_name}, region: {region}") + + # Now try to find the full resource ID using Azure CLI + cluster_resource_id = find_aks_cluster_by_name_and_region(cluster_name, region) + if cluster_resource_id: + return cluster_resource_id + + except Exception as e: + logging.debug(f"Failed to parse cluster server URL: {e}") + + # Third try: Get nodes and look for Azure-specific labels + try: + result = subprocess.run( + ["kubectl", "get", "nodes", "-o", "jsonpath={.items[0].metadata.labels}"], + capture_output=True, + text=True, + timeout=10 + ) + if result.returncode == 0: + node_labels = result.stdout.strip() + logging.debug(f"Node labels: {node_labels}") + + # Look for AKS-specific labels + # AKS nodes typically have labels like: + # kubernetes.azure.com/cluster: cluster-resource-id + # agentpool: nodepool-name + # kubernetes.io/hostname: aks-nodepool-12345-vmss000000 + + import json + try: + labels = json.loads(node_labels) + + # Check for cluster resource ID in labels + cluster_id = labels.get("kubernetes.azure.com/cluster") + if cluster_id and not cluster_id.startswith("MC_"): + # Ensure it's not a node resource group name + logging.debug(f"Found cluster resource ID in node labels: {cluster_id}") + return cluster_id + + # Try to extract cluster name from hostname + hostname = labels.get("kubernetes.io/hostname", "") + if "aks-" in hostname: + # Extract cluster info from hostname pattern + # Hostname format: aks-nodepool-12345-vmss000000 + parts = hostname.split("-") + if len(parts) >= 3: + # This is a fallback - we'd need more info to build full resource ID + logging.debug(f"Detected AKS node hostname pattern: {hostname}") + + except json.JSONDecodeError: + logging.debug("Failed to parse node labels as JSON") + + except Exception as e: + logging.debug(f"Failed to get node information: {e}") + + # Fourth try: Try getting cluster info directly using az aks get-credentials output + try: + # Get all clusters and try to match with current context + result = subprocess.run( + ["az", "aks", "list", "--query", "[].{name:name,resourceGroup:resourceGroup,id:id}", "-o", "json"], + capture_output=True, + text=True, + timeout=30 + ) + + if result.returncode == 0: + clusters = json.loads(result.stdout) + for cluster in clusters: + cluster_name = cluster.get('name', '') + resource_group = cluster.get('resourceGroup', '') + cluster_id = cluster.get('id', '') + + # Check if current context matches this cluster + if (cluster_name in current_context or + resource_group in current_context or + current_context.endswith(cluster_name)): + logging.debug(f"Found matching AKS cluster: {cluster_id}") + return cluster_id + + except Exception as e: + logging.debug(f"Failed to get cluster list: {e}") + + except Exception as e: + logging.debug(f"Failed to get AKS cluster ID from kubectl: {e}") + + return None + +def find_aks_cluster_by_name_and_region(cluster_name: str, region: str) -> Optional[str]: + """ + Find AKS cluster resource ID by name and region using Azure CLI. + + Args: + cluster_name: Name of the AKS cluster + region: Azure region of the cluster + + Returns: + str: Full Azure resource ID if found, None otherwise + """ + try: + import subprocess + + # Try to find the cluster using Azure CLI + result = subprocess.run( + ["az", "aks", "list", "--query", f"[?name=='{cluster_name}' && location=='{region}'].id", "-o", "tsv"], + capture_output=True, + text=True, + timeout=30 + ) + + if result.returncode == 0 and result.stdout.strip(): + cluster_id = result.stdout.strip() + logging.debug(f"Found AKS cluster via Azure CLI: {cluster_id}") + return cluster_id + + # If exact match fails, try searching by name only + result = subprocess.run( + ["az", "aks", "list", "--query", f"[?name=='{cluster_name}'].id", "-o", "tsv"], + capture_output=True, + text=True, + timeout=30 + ) + + if result.returncode == 0 and result.stdout.strip(): + cluster_ids = result.stdout.strip().split('\n') + if len(cluster_ids) == 1: + cluster_id = cluster_ids[0] + logging.debug(f"Found AKS cluster by name via Azure CLI: {cluster_id}") + return cluster_id + elif len(cluster_ids) > 1: + logging.debug(f"Multiple clusters found with name {cluster_name}, using first one") + return cluster_ids[0] + + # If no match by name, try getting current kubectl context cluster + # This handles cases where the detected name might not match exactly + result = subprocess.run( + ["kubectl", "config", "current-context"], + capture_output=True, + text=True, + timeout=10 + ) + + if result.returncode == 0: + context_name = result.stdout.strip() + logging.debug(f"Trying to find cluster for kubectl context: {context_name}") + + # Try to find cluster that matches the current context + # Sometimes context names contain cluster names + if cluster_name in context_name: + result = subprocess.run( + ["az", "aks", "list", "--query", f"[?contains(name, '{cluster_name}')].id", "-o", "tsv"], + capture_output=True, + text=True, + timeout=30 + ) + + if result.returncode == 0 and result.stdout.strip(): + cluster_ids = result.stdout.strip().split('\n') + if cluster_ids: + cluster_id = cluster_ids[0] + logging.debug(f"Found AKS cluster by partial name match: {cluster_id}") + return cluster_id + + except Exception as e: + logging.debug(f"Failed to find AKS cluster via Azure CLI: {e}") + + return None + + +def get_accessible_subscriptions(credential) -> list[str]: + """ + Get list of subscription IDs that the credential has access to. + + Args: + credential: Azure credential object + + Returns: + list[str]: List of subscription IDs + """ + try: + # This is a simplified approach - in practice you might want to use + # the Azure Management SDK to get subscriptions + + # For now, we'll try to get the default subscription + # This would need to be enhanced for multi-subscription scenarios + subscription_id = os.environ.get("AZURE_SUBSCRIPTION_ID") + if subscription_id: + return [subscription_id] + + # If no explicit subscription, try to get from Azure CLI config + try: + import subprocess + result = subprocess.run( + ["az", "account", "show", "--query", "id", "-o", "tsv"], + capture_output=True, + text=True, + timeout=10 + ) + if result.returncode == 0 and result.stdout.strip(): + return [result.stdout.strip()] + except Exception: + pass + + except Exception as e: + logging.debug(f"Failed to get accessible subscriptions: {e}") + + return [] + + +def extract_cluster_name_from_resource_id(resource_id: str) -> Optional[str]: + """ + Extract the cluster name from an Azure resource ID. + + Args: + resource_id: Full Azure resource ID + + Returns: + str: Cluster name if extracted successfully, None otherwise + """ + try: + # Azure resource ID format: + # /subscriptions/{subscription-id}/resourceGroups/{resource-group}/providers/Microsoft.ContainerService/managedClusters/{cluster-name} + parts = resource_id.split("/") + if len(parts) >= 9 and parts[-2] == "managedClusters": + return parts[-1] + except Exception as e: + logging.debug(f"Failed to extract cluster name from resource ID {resource_id}: {e}") + + return None + + +def check_if_running_in_aks() -> bool: + """ + Check if the current environment is running inside an AKS cluster. + + Returns: + bool: True if running in AKS, False otherwise + """ + try: + # Check for Kubernetes service account + if os.path.exists("/var/run/secrets/kubernetes.io/serviceaccount/token"): + # Check if we can access Azure Instance Metadata Service + metadata_url = "http://169.254.169.254/metadata/instance?api-version=2021-02-01" + headers = {"Metadata": "true"} + + response = requests.get(metadata_url, headers=headers, timeout=5) + if response.status_code == 200: + metadata = response.json() + # Check if we're running on Azure (which combined with Kubernetes suggests AKS) + if metadata.get("compute", {}).get("provider") == "Microsoft.Compute": + return True + except Exception as e: + logging.debug(f"Failed to check if running in AKS: {e}") + + return False + + +def execute_azure_resource_graph_query(query: str, subscription_id: str) -> Optional[Dict]: + """ + Execute an Azure Resource Graph query. + + Args: + query: The Azure Resource Graph query to execute + subscription_id: The subscription ID to query + + Returns: + dict: Query results if successful, None otherwise + """ + try: + from azure.mgmt.resourcegraph import ResourceGraphClient + from azure.mgmt.resourcegraph.models import QueryRequest + + credential = DefaultAzureCredential() + + # Create Resource Graph client + graph_client = ResourceGraphClient(credential) + + # Create the query request + query_request = QueryRequest( + query=query, + subscriptions=[subscription_id] + ) + + # Execute the query + query_response = graph_client.resources(query_request) + + if query_response and hasattr(query_response, 'data'): + return { + "data": query_response.data, + "total_records": getattr(query_response, 'total_records', 0), + "count": getattr(query_response, 'count', 0) + } + + except ImportError: + logging.warning("azure-mgmt-resourcegraph package not available. Install it with: pip install azure-mgmt-resourcegraph") + return None + except AzureError as e: + logging.error(f"Azure error executing Resource Graph query: {e}") + except Exception as e: + logging.error(f"Unexpected error executing Resource Graph query: {e}") + + return None + + +def get_azure_monitor_workspace_for_cluster(cluster_resource_id: str) -> Optional[Dict]: + """ + Get Azure Monitor workspace details for a given AKS cluster using Azure Resource Graph. + + Args: + cluster_resource_id: Full Azure resource ID of the AKS cluster + + Returns: + dict: Azure Monitor workspace details if found, None otherwise + """ + try: + # Extract subscription ID from cluster resource ID + parts = cluster_resource_id.split("/") + if len(parts) >= 3: + subscription_id = parts[2] + else: + logging.error(f"Invalid cluster resource ID format: {cluster_resource_id}") + return None + + # The ARG query from the requirements, parameterized + query = f""" + resources + | where type == "microsoft.insights/datacollectionrules" + | extend ma = properties.destinations.monitoringAccounts + | extend flows = properties.dataFlows + | mv-expand flows + | where flows.streams contains "Microsoft-PrometheusMetrics" + | mv-expand ma + | where array_index_of(flows.destinations, tostring(ma.name)) != -1 + | project dcrId = tolower(id), azureMonitorWorkspaceResourceId=tolower(tostring(ma.accountResourceId)) + | join (insightsresources | extend clusterId = split(tolower(id), '/providers/microsoft.insights/datacollectionruleassociations/')[0] | where clusterId =~ "{cluster_resource_id.lower()}" | project clusterId = tostring(clusterId), dcrId = tolower(tostring(parse_json(properties).dataCollectionRuleId)), dcraName = name) on dcrId + | join kind=leftouter (resources | where type == "microsoft.monitor/accounts" | extend prometheusQueryEndpoint=tostring(properties.metrics.prometheusQueryEndpoint) | extend amwLocation = location | project azureMonitorWorkspaceResourceId=tolower(id), prometheusQueryEndpoint, amwLocation) on azureMonitorWorkspaceResourceId + | project-away dcrId1, azureMonitorWorkspaceResourceId1 + | join kind=leftouter (resources | where type == "microsoft.dashboard/grafana" | extend amwIntegrations = properties.grafanaIntegrations.azureMonitorWorkspaceIntegrations | mv-expand amwIntegrations | extend azureMonitorWorkspaceResourceId = tolower(tostring(amwIntegrations.azureMonitorWorkspaceResourceId)) | where azureMonitorWorkspaceResourceId != "" | extend grafanaObject = pack("grafanaResourceId", tolower(id), "grafanaWorkspaceName", name, "grafanaEndpoint", properties.endpoint) | summarize associatedGrafanas=make_list(grafanaObject) by azureMonitorWorkspaceResourceId) on azureMonitorWorkspaceResourceId + | extend amwToGrafana = pack("azureMonitorWorkspaceResourceId", azureMonitorWorkspaceResourceId, "prometheusQueryEndpoint", prometheusQueryEndpoint, "amwLocation", amwLocation, "associatedGrafanas", associatedGrafanas) + | summarize amwToGrafanas=make_list(amwToGrafana) by dcrResourceId = dcrId, dcraName + | order by dcrResourceId + """ + + result = execute_azure_resource_graph_query(query, subscription_id) + + if result and result.get("data"): + data = result["data"] + if isinstance(data, list) and len(data) > 0: + # Take the first result + first_result = data[0] + amw_to_grafanas = first_result.get("amwToGrafanas", []) + + if amw_to_grafanas and len(amw_to_grafanas) > 0: + # Take the first Azure Monitor workspace + amw_info = amw_to_grafanas[0] + + prometheus_endpoint = amw_info.get("prometheusQueryEndpoint") + if prometheus_endpoint: + return { + "prometheus_query_endpoint": prometheus_endpoint, + "azure_monitor_workspace_resource_id": amw_info.get("azureMonitorWorkspaceResourceId"), + "location": amw_info.get("amwLocation"), + "associated_grafanas": amw_info.get("associatedGrafanas", []) + } + + logging.info(f"No Azure Monitor workspace found for cluster {cluster_resource_id}") + return None + + except Exception as e: + logging.error(f"Failed to get Azure Monitor workspace for cluster {cluster_resource_id}: {e}") + return None + + +def enhance_promql_with_cluster_filter(promql_query: str, cluster_name: str) -> str: + """ + Enhance a PromQL query to include cluster filtering. + + This function ensures that ALL metric selectors in a PromQL query include + a cluster filter to scope the query to a specific AKS cluster. + + Args: + promql_query: Original PromQL query + cluster_name: Name of the cluster to filter by + + Returns: + str: Enhanced PromQL query with cluster filtering on all metrics + """ + try: + logging.debug(f"Adding cluster filter for cluster '{cluster_name}' to query: {promql_query}") + + # Check if cluster filter already exists in the query + if f'cluster="{cluster_name}"' in promql_query or f"cluster='{cluster_name}'" in promql_query: + logging.debug("Cluster filter already present in query") + return promql_query + + # Define PromQL functions and keywords that should not be treated as metrics + promql_functions = { + 'rate', 'irate', 'sum', 'avg', 'max', 'min', 'count', 'stddev', 'stdvar', + 'increase', 'delta', 'idelta', 'by', 'without', 'on', 'ignoring', + 'group_left', 'group_right', 'offset', 'bool', 'and', 'or', 'unless', + 'histogram_quantile', 'abs', 'ceil', 'floor', 'round', 'sqrt', 'exp', + 'ln', 'log2', 'log10', 'sin', 'cos', 'tan', 'asin', 'acos', 'atan', + 'sinh', 'cosh', 'tanh', 'asinh', 'acosh', 'atanh', 'deg', 'rad', + 'm', 's', 'h', 'd', 'w', 'y', # time units + 'pod', 'node', 'instance', 'job', 'container', 'namespace' # common label names + } + + # More precise pattern that only matches actual metrics (not grouping labels or keywords) + # This pattern looks for metric names followed by either { or whitespace, but excludes: + # - Words followed immediately by ( (functions) + # - Words that appear after "by" or "without" (grouping labels) + # - Words inside parentheses after "by" or "without" + + def replace_metric(match): + metric_name = match.group(1) + labels_part = match.group(2) or "" + + # Skip PromQL functions and keywords + if metric_name.lower() in promql_functions: + return match.group(0) + + # Get context around the match to make better decisions + start_pos = match.start() + end_pos = match.end() + + # Look at what comes before this match + before_text = promql_query[:start_pos].strip() + after_text = promql_query[end_pos:].strip() + + # Skip if this metric is followed by an opening parenthesis (it's likely a function) + if after_text.startswith('('): + return match.group(0) + + # Skip if this appears to be in a grouping clause (after "by" or "without") + # Look for patterns like "by (pod, node)" or "by(pod)" + if before_text.endswith(' by') or before_text.endswith(' without') or before_text.endswith('by') or before_text.endswith('without'): + return match.group(0) + + # Skip if we're inside parentheses after by/without + # Find the last occurrence of "by" or "without" before this position + last_by = before_text.rfind(' by ') + last_without = before_text.rfind(' without ') + last_keyword_pos = max(last_by, last_without) + + if last_keyword_pos >= 0: + # Check if there's an opening parenthesis after the keyword and before our match + text_after_keyword = promql_query[last_keyword_pos:start_pos] + open_paren_pos = text_after_keyword.rfind('(') + close_paren_pos = text_after_keyword.rfind(')') + + # If we found an opening paren after the keyword and no closing paren, we're inside grouping + if open_paren_pos > close_paren_pos: + return match.group(0) + + # Skip time range indicators like [5m] + if metric_name in {'m', 's', 'h', 'd', 'w', 'y'} and after_text.startswith(']'): + return match.group(0) + + # Skip single letter variables that might be part of time ranges + if len(metric_name) == 1 and metric_name in 'mhdwy': + return match.group(0) + + cluster_filter = f'cluster="{cluster_name}"' + + if labels_part: + # Has existing labels + labels_content = labels_part[1:-1].strip() # Remove { and } + + # Check if cluster filter already exists + if 'cluster=' in labels_content: + return match.group(0) + + if labels_content: + # Add cluster filter to existing labels + return f'{metric_name}{{{cluster_filter},{labels_content}}}' + else: + # Empty labels {}, replace with cluster filter + return f'{metric_name}{{{cluster_filter}}}' + else: + # No labels, add cluster filter + return f'{metric_name}{{{cluster_filter}}}' + + # Pattern matches: metric_name optionally followed by {labels} + pattern = r'\b([a-zA-Z_:][a-zA-Z0-9_:]*)\s*(\{[^}]*\})?' + + # Apply the transformation + enhanced_query = re.sub(pattern, replace_metric, promql_query) + + # Validation: check if the transformation looks reasonable + open_braces = enhanced_query.count('{') + close_braces = enhanced_query.count('}') + + if open_braces != close_braces: + logging.warning("Cluster filter enhancement created mismatched braces, reverting to original query") + return promql_query + + # Additional check: make sure we actually added cluster filters + if cluster_name not in enhanced_query: + logging.warning("No cluster filter was added to the query") + + logging.debug(f"Enhanced PromQL query: {promql_query} -> {enhanced_query}") + return enhanced_query + + except Exception as e: + logging.warning(f"Failed to enhance PromQL query with cluster filter: {e}") + return promql_query + + +import os diff --git a/holmes/plugins/toolsets/azuremonitorlogs/__init__.py b/holmes/plugins/toolsets/azuremonitorlogs/__init__.py new file mode 100644 index 000000000..51e32663c --- /dev/null +++ b/holmes/plugins/toolsets/azuremonitorlogs/__init__.py @@ -0,0 +1 @@ +"""Azure Monitor Logs toolset for HolmesGPT.""" diff --git a/holmes/plugins/toolsets/azuremonitorlogs/azuremonitorlogs_instructions.jinja2 b/holmes/plugins/toolsets/azuremonitorlogs/azuremonitorlogs_instructions.jinja2 new file mode 100644 index 000000000..cf707e833 --- /dev/null +++ b/holmes/plugins/toolsets/azuremonitorlogs/azuremonitorlogs_instructions.jinja2 @@ -0,0 +1,141 @@ +# Azure Monitor Logs (Container Insights) Integration Instructions + +## Overview +The Azure Monitor Logs toolset detects Azure Monitor Container Insights configuration and provides Log Analytics workspace details for AKS cluster log analysis. This toolset provides workspace configuration so other tools (like azure mcp server) can run KQL queries for the specific workspace detected by this tool. + +!!! important "Explicit Enablement Required" + This toolset is **disabled by default**. Users must explicitly enable it with `azuremonitorlogs: enabled: true` in their configuration. + +## Key Capabilities +1. **AKS Cluster Detection**: Automatically detect if running in AKS cluster and get cluster resource ID +2. **Container Insights Detection**: Check if Azure Monitor Container Insights is enabled for the cluster +3. **Workspace Discovery**: Extract Log Analytics workspace ID and full Azure resource ID +4. **Stream Profiling**: Detect enabled log streams and map them to Log Analytics tables +5. **Azure MCP Guidance**: Provide loganalytics workspace details for another tool (Azure MCP server) so that it can run actual KQL queries + +## Important Distinctions +- **Detection Only**: This toolset only detects and provides workspace configuration details +- **No Direct Querying**: This toolset does NOT execute KQL queries against Log Analytics +- **Azure MCP Integration**: All KQL queries must be performed via Azure MCP server tool +- **Workspace Provider**: Provides workspace details that Azure MCP server needs for configuration + +## Tools Available +1. `check_aks_cluster_context` - Check if running in AKS cluster +2. `get_aks_cluster_resource_id` - Get full Azure resource ID of AKS cluster +3. `check_azure_monitor_logs_enabled` - Detect Container Insights and get workspace details +4. use azure mcp server to run KQL queries against the detected workspace + + + +### Workspace Configuration +- **Workspace ID** (GUID format): Used by Azure MCP server for KQL queries +- **Workspace Resource ID** (Full Azure path): Used for ARM API access and permissions +- **Cluster Filter**: Exact `_ResourceId` value for filtering KQL queries + +### Log Stream Information +- **Available Streams**: What log types are enabled (ContainerLogV2, KubePodInventory, etc.) +- **Table Mapping**: Maps streams to corresponding Log Analytics tables +- **Sample Queries**: Examples of properly filtered KQL queries + +## Critical KQL Query Requirements +**ALL KQL queries executed via Azure MCP server MUST include cluster filtering:** + +```kql +| where _ResourceId == "/subscriptions/.../clusters/your-cluster" +``` + +### Common Log Analytics Tables +Based on detected streams, queries can target: +- **ContainerLogV2**: Container stdout/stderr logs +- **KubePodInventory**: Pod metadata and status +- **KubeEvents**: Kubernetes events +- **KubeNodeInventory**: Node information +- **Perf**: Performance metrics +- **InsightsMetrics**: Additional metrics data + +## Workflow for Log Analysis +1. **Detection**: Use `check_azure_monitor_logs_enabled` to detect workspace +2. **Configuration**: Configure Azure MCP server with detected workspace details +3. **Querying**: Use Azure MCP server tools for actual KQL queries with cluster filtering + +## Special Case: Cost Optimization Analysis +**When asked about Container Insights cost optimization, IMMEDIATELY proceed to execute KQL queries using available tools:** + +1. **Detect workspace** using `check_azure_monitor_logs_enabled` +2. **IMMEDIATELY execute KQL queries** using any available KQL execution tool (azure mcp server or similar) +3. **Do NOT stop** after workspace detection - proceed to live cost analysis NOW +4. **Execute ALL 5 queries** from the Azure Monitor cost optimization runbook with detected workspace ID +5. **Generate real USD savings** calculations based on actual cluster data + +**MANDATORY EXECUTION**: For cost optimization requests, you MUST execute queries immediately after workspace detection. Do NOT provide "next steps" or ask for permission - execute the KQL queries now using any available KQL tool. + +**EXECUTE QUERIES IMMEDIATELY**: After detecting the workspace, you MUST immediately call the azure mcp tools. Do not say "Stand by" or "Executing queries" - actually call the tools now. + +**MANDATORY TOOL INVOCATION**: +1. Parse subscription ID from the workspace resource ID (extract from "/subscriptions/SUBSCRIPTION-ID/...") +2. Call "monitor workspace log query" tool with these parameters: + - --subscription: [parsed subscription ID] + - --workspace: [detected workspace GUID] + - --table-name: Usage (for cost queries) or ContainerLogV2 (for log analysis) + - --query: [the specific KQL query] +3. Execute each of the 5 cost optimization queries as separate tool calls +4. Process the results and DISPLAY the COMPLETE executive report with competitive analysis tables in the console +5. After displaying the full report, call generate_cost_optimization_pdf tool to create a PDF file with the same report content and provide download link + +**DUAL OUTPUT MANDATORY**: You must BOTH display the full report on screen AND generate the PDF file. Do not skip either step. + +**EXAMPLE TOOL CALL:** +``` +monitor workspace log query --subscription abc123-def456-789 --workspace ae2a62d2-afb3-4e81-a5b4-ca29215e1cd8 --table-name Usage --query "Usage | where TimeGenerated > ago(24h) | summarize TotalGB = sum(Quantity) / 1024 by DataType" +``` + +**CRITICAL**: When you say "Executing all 5 required queries now" - actually DO IT by calling the "monitor workspace log query" tool. Do not just announce it. + +## Example Usage Scenarios + +### Scenario 1: Initial Setup Detection +``` +User: "Is Azure Monitor logs enabled for this cluster?" +Response: Use check_azure_monitor_logs_enabled to detect Container Insights +``` + +### Scenario 2: Log Analysis Request +``` +User: "Show me container logs from the last hour" +Response: +1. First check workspace with check_azure_monitor_logs_enabled +2. Instruct user to use Azure MCP server with detected workspace +3. Provide sample KQL: ContainerLogV2 | where _ResourceId == "cluster-id" | where TimeGenerated > ago(1h) +``` + +### Scenario 3: Stream Availability Check +``` +User: "What log data is available for this cluster?" +Response: Use check_azure_monitor_logs_enabled to show available streams and tables +``` + +## Error Handling +- **No AKS Context**: Guide user to connect to AKS cluster +- **Container Insights Not Enabled**: Direct user to enable Container Insights in Azure portal +- **No Workspace Found**: Verify Container Insights configuration and permissions + +## Azure MCP Server Configuration Guidance +When workspace is detected, provide clear instructions: + +1. **Workspace Details**: Share detected workspace ID and resource ID +2. **Authentication**: Ensure Azure MCP server has proper credentials +3. **Cluster Filtering**: Emphasize mandatory `_ResourceId` filtering in all queries +4. **Table Availability**: List available Log Analytics tables based on detected streams + +## Sample Response Format +```json +{ + "azure_monitor_logs_enabled": true, + "log_analytics_workspace_id": "12345678-1234-1234-1234-123456789012", + "cluster_filter_kql": "| where _ResourceId == \"/subscriptions/.../clusters/my-cluster\"", + "available_log_tables": ["ContainerLogV2", "KubePodInventory", "KubeEvents"], + "azure_mcp_guidance": "Use/call into Azure MCP server with detected workspace details" +} +``` + +Remember: This toolset is the bridge between AKS cluster detection and Azure MCP server configuration for log analysis workflows. diff --git a/holmes/plugins/toolsets/azuremonitorlogs/azuremonitorlogs_toolset.py b/holmes/plugins/toolsets/azuremonitorlogs/azuremonitorlogs_toolset.py new file mode 100644 index 000000000..f5bbc8f11 --- /dev/null +++ b/holmes/plugins/toolsets/azuremonitorlogs/azuremonitorlogs_toolset.py @@ -0,0 +1,392 @@ +"""Azure Monitor Logs toolset for HolmesGPT.""" + +import json +import logging +import os +import uuid +from datetime import datetime +from typing import Any, Dict, List, Optional, Tuple + +from pydantic import BaseModel, field_validator + +from holmes.core.tools import ( + CallablePrerequisite, + StructuredToolResult, + Tool, + ToolParameter, + ToolResultStatus, + Toolset, + ToolsetTag, +) + +from .utils import ( + check_if_running_in_aks, + extract_cluster_name_from_resource_id, + get_aks_cluster_resource_id, + get_container_insights_workspace_for_cluster, + generate_azure_mcp_guidance, + map_streams_to_log_analytics_tables, +) + +class AzureMonitorLogsConfig(BaseModel): + """Configuration for Azure Monitor Logs toolset.""" + cluster_name: Optional[str] = None + cluster_resource_id: Optional[str] = None + auto_detect_cluster: bool = True + log_analytics_workspace_id: Optional[str] = None + log_analytics_workspace_resource_id: Optional[str] = None + data_collection_rule_id: Optional[str] = None + data_collection_rule_association_name: Optional[str] = None + data_collection_settings: Optional[Dict] = None + enabled_log_streams: Optional[List[str]] = None + data_flows: Optional[List[Dict]] = None + stream_to_table_mapping: Optional[Dict[str, str]] = None + +class BaseAzureMonitorLogsTool(Tool): + """Base class for Azure Monitor Logs tools.""" + toolset: "AzureMonitorLogsToolset" + +class CheckAKSClusterContext(BaseAzureMonitorLogsTool): + """Tool to check if running in AKS cluster context.""" + + def __init__(self, toolset: "AzureMonitorLogsToolset"): + super().__init__( + name="check_aks_cluster_context", + description="Check if the current environment is running inside an AKS cluster", + parameters={}, + toolset=toolset, + ) + + def _invoke(self, params: Any) -> StructuredToolResult: + try: + is_aks = check_if_running_in_aks() + + data = { + "running_in_aks": is_aks, + "message": "Running in AKS cluster" if is_aks else "Not running in AKS cluster", + } + + return StructuredToolResult( + status=ToolResultStatus.SUCCESS, + data=data, + params=params, + ) + + except Exception as e: + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error=f"Failed to check AKS cluster context: {str(e)}", + params=params, + ) + + def get_parameterized_one_liner(self, params) -> str: + return "Check if running in AKS cluster" + +class GetAKSClusterResourceID(BaseAzureMonitorLogsTool): + """Tool to get the Azure resource ID of the current AKS cluster.""" + + def __init__(self, toolset: "AzureMonitorLogsToolset"): + super().__init__( + name="get_aks_cluster_resource_id", + description="Get the full Azure resource ID of the current AKS cluster", + parameters={}, + toolset=toolset, + ) + + def _invoke(self, params: Any) -> StructuredToolResult: + try: + cluster_resource_id = get_aks_cluster_resource_id() + + if cluster_resource_id: + cluster_name = extract_cluster_name_from_resource_id(cluster_resource_id) + + data = { + "cluster_resource_id": cluster_resource_id, + "cluster_name": cluster_name, + "message": f"Found AKS cluster: {cluster_name}", + } + + return StructuredToolResult( + status=ToolResultStatus.SUCCESS, + data=data, + params=params, + ) + else: + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error="Could not determine AKS cluster resource ID. Make sure you are running in an AKS cluster or have proper Azure credentials configured.", + params=params, + ) + + except Exception as e: + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error=f"Failed to get AKS cluster resource ID: {str(e)}", + params=params, + ) + + def get_parameterized_one_liner(self, params) -> str: + return "Get AKS cluster Azure resource ID" + +class CheckAzureMonitorLogsEnabled(BaseAzureMonitorLogsTool): + """Tool to check if Azure Monitor Container Insights (logs) is enabled for the AKS cluster.""" + + def __init__(self, toolset: "AzureMonitorLogsToolset"): + super().__init__( + name="check_azure_monitor_logs_enabled", + description="Check if Azure Monitor Container Insights (logs) is enabled for the AKS cluster and get Log Analytics workspace details for Azure MCP server configuration", + parameters={ + "cluster_resource_id": ToolParameter( + description="Azure resource ID of the AKS cluster (optional, will auto-detect if not provided)", + type="string", + required=False, + ), + }, + toolset=toolset, + ) + + def _invoke(self, params: Any) -> StructuredToolResult: + try: + cluster_resource_id = params.get("cluster_resource_id") + + # Auto-detect cluster resource ID if not provided + if not cluster_resource_id: + cluster_resource_id = get_aks_cluster_resource_id() + + if not cluster_resource_id: + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error="Could not determine AKS cluster resource ID. Please provide cluster_resource_id parameter or ensure you are running in an AKS cluster.", + params=params, + ) + + # Get Container Insights workspace details using ARG query + workspace_info = get_container_insights_workspace_for_cluster(cluster_resource_id) + + if workspace_info: + cluster_name = extract_cluster_name_from_resource_id(cluster_resource_id) + + # Generate Azure MCP guidance + mcp_guidance = generate_azure_mcp_guidance(workspace_info, cluster_resource_id) + + # Map extension streams to Log Analytics tables + extension_streams = workspace_info.get("extension_streams", []) + stream_to_table = map_streams_to_log_analytics_tables(extension_streams) + available_tables = list(stream_to_table.values()) + + data = { + "azure_monitor_logs_enabled": True, + "container_insights_enabled": True, + "cluster_resource_id": cluster_resource_id, + "cluster_name": cluster_name, + + # Log Analytics workspace details (primary information for Azure MCP) + "log_analytics_workspace_id": workspace_info.get("log_analytics_workspace_id"), + "log_analytics_workspace_resource_id": workspace_info.get("log_analytics_workspace_resource_id"), + + # Container Insights configuration details + "data_collection_rule_id": workspace_info.get("data_collection_rule_id"), + "data_collection_rule_association_name": workspace_info.get("data_collection_rule_association_name"), + "data_collection_settings": workspace_info.get("data_collection_settings"), + "extension_streams": extension_streams, + "data_flows": workspace_info.get("data_flows"), + + # Azure MCP integration guidance + "azure_mcp_configuration": mcp_guidance, + "available_log_tables": available_tables, + "stream_to_table_mapping": stream_to_table, + + "message": f"Azure Monitor Container Insights is enabled for cluster {cluster_name}. Use the workspace details to configure Azure MCP server for KQL queries." + } + + # Update toolset configuration with discovered information + if self.toolset.config: + self.toolset.config.cluster_name = cluster_name + self.toolset.config.cluster_resource_id = cluster_resource_id + self.toolset.config.log_analytics_workspace_id = workspace_info.get("log_analytics_workspace_id") + self.toolset.config.log_analytics_workspace_resource_id = workspace_info.get("log_analytics_workspace_resource_id") + self.toolset.config.data_collection_rule_id = workspace_info.get("data_collection_rule_id") + self.toolset.config.enabled_log_streams = extension_streams + self.toolset.config.stream_to_table_mapping = stream_to_table + + return StructuredToolResult( + status=ToolResultStatus.SUCCESS, + data=data, + params=params, + ) + else: + cluster_name = extract_cluster_name_from_resource_id(cluster_resource_id) + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error=f"Azure Monitor Container Insights (logs) is not enabled for AKS cluster {cluster_name}. Please enable Container Insights in the Azure portal.", + params=params, + ) + + except Exception as e: + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error=f"Failed to check Azure Monitor logs status: {str(e)}", + params=params, + ) + + def get_parameterized_one_liner(self, params) -> str: + cluster_id = params.get("cluster_resource_id", "auto-detect") + return f"Check Azure Monitor Container Insights status for cluster: {cluster_id}" + +class GenerateCostOptimizationReport(BaseAzureMonitorLogsTool): + """Tool to generate a Markdown report from cost optimization analysis content.""" + + def __init__(self, toolset: "AzureMonitorLogsToolset"): + super().__init__( + name="generate_cost_optimization_report", + description="Generate a Markdown report file from Azure Monitor cost optimization analysis content", + parameters={ + "report_content": ToolParameter( + description="The complete cost optimization report content in markdown format", + type="string", + required=True, + ), + "cluster_name": ToolParameter( + description="Name of the AKS cluster for the report filename (optional)", + type="string", + required=False, + ), + }, + toolset=toolset, + ) + + def _invoke(self, params: Any) -> StructuredToolResult: + try: + report_content = params.get("report_content", "") + cluster_name = params.get("cluster_name", "unknown-cluster") + + if not report_content: + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error="Report content is required to generate Markdown", + params=params, + ) + + # Generate random filename with timestamp + timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") + random_suffix = str(uuid.uuid4())[:8] + filename = f"azure_monitor_cost_optimization_{cluster_name}_{timestamp}_{random_suffix}.md" + + # Create reports directory if it doesn't exist + reports_dir = "cost_optimization_reports" + os.makedirs(reports_dir, exist_ok=True) + + # Full file path + file_path = os.path.join(reports_dir, filename) + + # Prepare content with metadata + full_content = f"""# Azure Monitor Container Insights Cost Optimization Report + +**Generated**: {datetime.now().strftime("%Y-%m-%d %H:%M:%S UTC")} +**Cluster**: {cluster_name} +**Report ID**: {random_suffix} + +--- + +{report_content} + +--- + +**Disclaimer**: This report is generated by HolmesGPT AI. All recommendations should be independently verified by Azure specialists before implementation. +""" + + # Write to file + with open(file_path, 'w', encoding='utf-8') as f: + f.write(full_content) + + # Get absolute path for the link + abs_file_path = os.path.abspath(file_path) + + data = { + "markdown_generated": True, + "filename": filename, + "file_path": abs_file_path, + "file_size_bytes": len(full_content.encode('utf-8')), + "cluster_name": cluster_name, + "timestamp": timestamp, + "report_id": random_suffix, + "download_link": f"file://{abs_file_path}", + "message": f"Cost optimization report saved to: {abs_file_path}" + } + + return StructuredToolResult( + status=ToolResultStatus.SUCCESS, + data=data, + params=params, + ) + + except Exception as e: + return StructuredToolResult( + status=ToolResultStatus.ERROR, + error=f"Failed to generate Markdown report: {str(e)}", + params=params, + ) + + def get_parameterized_one_liner(self, params) -> str: + cluster_name = params.get("cluster_name", "cluster") + return f"Generate cost optimization Markdown report for {cluster_name}" + +class AzureMonitorLogsToolset(Toolset): + """Azure Monitor Logs toolset for detecting Container Insights and Log Analytics workspace details.""" + + def __init__(self): + super().__init__( + name="azuremonitorlogs", + description="Azure Monitor Logs integration to detect Container Insights and provide Log Analytics workspace details for AKS cluster log analysis via Azure MCP server", + docs_url="https://docs.robusta.dev/master/configuration/holmesgpt/toolsets/azuremonitor-logs.html", + icon_url="https://raw.githubusercontent.com/robusta-dev/holmesgpt/master/images/integration_logos/azure.png", + prerequisites=[CallablePrerequisite(callable=self.prerequisites_callable)], + tools=[ + CheckAKSClusterContext(toolset=self), + GetAKSClusterResourceID(toolset=self), + CheckAzureMonitorLogsEnabled(toolset=self), + GenerateCostOptimizationReport(toolset=self), + ], + tags=[ + ToolsetTag.CORE + ], + is_default=False, # Disabled by default - users must explicitly enable + ) + self._reload_llm_instructions() + + def _reload_llm_instructions(self): + """Load LLM instructions from Jinja template.""" + try: + template_file_path = os.path.abspath( + os.path.join(os.path.dirname(__file__), "azuremonitorlogs_instructions.jinja2") + ) + self._load_llm_instructions(jinja_template=f"file://{template_file_path}") + except Exception as e: + # Ignore any errors in loading instructions + logging.debug(f"Failed to load LLM instructions: {e}") + + def prerequisites_callable(self, config: dict[str, Any]) -> Tuple[bool, str]: + """Check prerequisites for the Azure Monitor Logs toolset.""" + try: + if not config: + self.config = AzureMonitorLogsConfig() + else: + self.config = AzureMonitorLogsConfig(**config) + + return True, "" + + except Exception as e: + logging.debug(f"Azure Monitor Logs toolset config initialization failed: {str(e)}") + self.config = AzureMonitorLogsConfig() + return True, "" + + def get_example_config(self) -> Dict[str, Any]: + """Return example configuration for the toolset.""" + example_config = AzureMonitorLogsConfig( + cluster_name="your-aks-cluster-name", + cluster_resource_id="/subscriptions/your-subscription/resourceGroups/your-rg/providers/Microsoft.ContainerService/managedClusters/your-cluster", + auto_detect_cluster=True, + log_analytics_workspace_id="your-workspace-guid", + log_analytics_workspace_resource_id="/subscriptions/your-subscription/resourcegroups/your-rg/providers/microsoft.operationalinsights/workspaces/your-workspace" + ) + return example_config.model_dump() diff --git a/holmes/plugins/toolsets/azuremonitorlogs/install.md b/holmes/plugins/toolsets/azuremonitorlogs/install.md new file mode 100644 index 000000000..bbc64eca1 --- /dev/null +++ b/holmes/plugins/toolsets/azuremonitorlogs/install.md @@ -0,0 +1,177 @@ +# Azure Monitor Logs Toolset Installation Guide + +## Overview +The Azure Monitor Logs toolset detects Azure Monitor Container Insights configuration and provides Log Analytics workspace details for AKS cluster log analysis via external Azure MCP server. + +## Prerequisites + +### 1. Azure Dependencies +Install required Azure SDK packages: +```bash +pip install azure-identity azure-mgmt-resourcegraph azure-mgmt-resource +``` + +### 2. Azure Authentication +The toolset uses `DefaultAzureCredential` for authentication. Ensure one of the following is configured: + +#### Option A: Azure CLI (Recommended for development) +```bash +az login +az account set --subscription "your-subscription-id" +``` + +#### Option B: Managed Identity (Recommended for production in AKS) +When running in AKS, configure managed identity with appropriate permissions. + +#### Option C: Service Principal (Alternative) +```bash +export AZURE_CLIENT_ID="your-client-id" +export AZURE_CLIENT_SECRET="your-client-secret" +export AZURE_TENANT_ID="your-tenant-id" +``` + +### 3. Required Azure Permissions +The authentication principal needs these permissions: + +- **Reader** role on the AKS cluster resource +- **Reader** role on the Log Analytics workspace +- **Reader** role on Data Collection Rules (for Container Insights detection) + +### 4. AKS Cluster Requirements +- AKS cluster with Azure Monitor Container Insights enabled +- Access to kubectl configured for the target cluster (for auto-detection) + +## Configuration + +!!! warning "Explicit Enablement Required" + The Azure Monitor Logs toolset is **disabled by default**. You must explicitly enable it in your configuration. + +### Basic Configuration (Required) +```yaml +toolsets: + azuremonitorlogs: + enabled: true # Required - toolset is disabled by default + auto_detect_cluster: true +``` + +### Advanced Configuration +```yaml +toolsets: + azuremonitorlogs: + enabled: true + auto_detect_cluster: true + cluster_name: "my-aks-cluster" + cluster_resource_id: "/subscriptions/12345678-1234-1234-1234-123456789012/resourceGroups/my-rg/providers/Microsoft.ContainerService/managedClusters/my-cluster" + log_analytics_workspace_id: "87654321-4321-4321-4321-210987654321" + log_analytics_workspace_resource_id: "/subscriptions/12345678-1234-1234-1234-123456789012/resourcegroups/my-rg/providers/microsoft.operationalinsights/workspaces/my-workspace" +``` + +## Verification + +### 1. Check Toolset Status +```bash +holmes toolset list | grep azuremonitorlogs +``` + +Expected output: +``` +azuremonitorlogs │ True │ enabled │ built-in │ │ +``` + +### 2. Test AKS Detection +```bash +holmes ask "is this an aks cluster?" +``` + +### 3. Test Container Insights Detection +```bash +holmes ask "is azure monitor logs enabled for this cluster?" +``` + +Expected response should include: +- Container Insights status +- Log Analytics workspace details +- Available log streams +- Azure MCP configuration guidance + +## Troubleshooting + +### Common Issues + +#### 1. Authentication Failures +``` +Error: DefaultAzureCredential failed to retrieve a token +``` + +**Solution**: Ensure Azure authentication is properly configured (see Prerequisites section). + +#### 2. No AKS Cluster Detected +``` +Error: Could not determine AKS cluster resource ID +``` + +**Solutions**: +- Ensure kubectl is connected to an AKS cluster: `kubectl config current-context` +- Verify Azure CLI is logged in: `az account show` +- Manually specify cluster resource ID in configuration + +#### 3. Container Insights Not Found +``` +Error: Azure Monitor Container Insights (logs) is not enabled +``` + +**Solutions**: +- Enable Container Insights in Azure portal for the AKS cluster +- Verify Data Collection Rules are properly configured +- Check Azure Resource Graph permissions + +#### 4. Missing Dependencies +``` +ImportError: No module named 'azure.mgmt.resourcegraph' +``` + +**Solution**: Install required packages: +```bash +pip install azure-mgmt-resourcegraph +``` + +### Debug Mode +Enable debug logging to troubleshoot issues: +```bash +export HOLMES_LOG_LEVEL=DEBUG +holmes ask "check azure monitor logs status" +``` + +## Azure MCP Server Integration + +Once the Azure Monitor Logs toolset detects your workspace, configure Azure MCP server: + +### 1. Azure MCP Server Installation +Follow Azure MCP server documentation for installation and configuration. + +### 2. Workspace Configuration +Use the workspace details provided by this toolset: +```json +{ + "workspace_id": "detected-workspace-guid", + "workspace_resource_id": "/subscriptions/.../workspaces/workspace-name", + "cluster_filter": "| where _ResourceId == \"/subscriptions/.../clusters/cluster-name\"" +} +``` + +### 3. Required KQL Query Filtering +ALL KQL queries via Azure MCP server MUST include cluster filtering: +```kql +ContainerLogV2 +| where _ResourceId == "/subscriptions/.../clusters/your-cluster" +| where TimeGenerated > ago(1h) +``` + +## Support +For issues specific to this toolset, check: +1. Azure authentication and permissions +2. AKS cluster connectivity +3. Container Insights configuration +4. Azure Resource Graph API access + +For Azure MCP server issues, refer to the Azure MCP server documentation. diff --git a/holmes/plugins/toolsets/azuremonitorlogs/utils.py b/holmes/plugins/toolsets/azuremonitorlogs/utils.py new file mode 100644 index 000000000..1493771cc --- /dev/null +++ b/holmes/plugins/toolsets/azuremonitorlogs/utils.py @@ -0,0 +1,480 @@ +"""Utility functions for Azure Monitor Logs toolset.""" + +import json +import logging +import os +import re +import subprocess +from typing import Dict, List, Optional + +from azure.core.exceptions import AzureError +from azure.identity import DefaultAzureCredential +from azure.mgmt.resourcegraph import ResourceGraphClient +from azure.mgmt.resourcegraph.models import QueryRequest +import requests + +def check_if_running_in_aks() -> bool: + """ + Check if the current environment is running inside an AKS cluster. + + Returns: + bool: True if running in AKS, False otherwise + """ + try: + # Check for Kubernetes service account + if os.path.exists("/var/run/secrets/kubernetes.io/serviceaccount/token"): + # Check if we can access Azure Instance Metadata Service + metadata_url = "http://169.254.169.254/metadata/instance?api-version=2021-02-01" + headers = {"Metadata": "true"} + + response = requests.get(metadata_url, headers=headers, timeout=5) + if response.status_code == 200: + metadata = response.json() + # Check if we're running on Azure (which combined with Kubernetes suggests AKS) + if metadata.get("compute", {}).get("provider") == "Microsoft.Compute": + return True + except Exception as e: + logging.debug(f"Failed to check if running in AKS: {e}") + + return False + +def get_aks_cluster_resource_id() -> Optional[str]: + """ + Get the Azure resource ID of the current AKS cluster. + + Returns: + str: The full Azure resource ID of the AKS cluster if found, None otherwise + """ + logging.info("Starting AKS cluster resource ID detection...") + + # First try kubectl-based detection (most reliable for AKS) + logging.info("Attempting kubectl-based detection...") + cluster_resource_id = get_aks_cluster_id_from_kubectl() + if cluster_resource_id: + logging.info(f"Successfully found cluster via kubectl: {cluster_resource_id}") + return cluster_resource_id + else: + logging.warning("kubectl-based detection failed") + + try: + # Try to get cluster info from Azure Instance Metadata Service + logging.info("Attempting Azure Instance Metadata Service detection...") + metadata_url = "http://169.254.169.254/metadata/instance?api-version=2021-02-01" + headers = {"Metadata": "true"} + + response = requests.get(metadata_url, headers=headers, timeout=5) + if response.status_code == 200: + metadata = response.json() + compute = metadata.get("compute", {}) + + # Extract subscription ID and resource group from metadata + subscription_id = compute.get("subscriptionId") + resource_group = compute.get("resourceGroupName") + + logging.info(f"Azure metadata - subscription: {subscription_id}, resource_group: {resource_group}") + + if subscription_id and resource_group: + # Try to find AKS cluster in the resource group + logging.info("Attempting to find AKS cluster via Azure Resource Management API...") + credential = DefaultAzureCredential() + from azure.mgmt.resource import ResourceManagementClient + resource_client = ResourceManagementClient(credential, subscription_id) + + # Look for AKS clusters in the resource group + resources = resource_client.resources.list_by_resource_group( + resource_group_name=resource_group, + filter="resourceType eq 'Microsoft.ContainerService/managedClusters'" + ) + + for resource in resources: + # Return the first AKS cluster found + logging.info(f"Found AKS cluster via metadata approach: {resource.id}") + return resource.id + else: + logging.warning("Could not extract subscription ID or resource group from Azure metadata") + else: + logging.warning(f"Azure metadata service returned status code: {response.status_code}") + + except Exception as e: + logging.warning(f"Failed to get AKS cluster resource ID from metadata: {e}") + + logging.error("All AKS cluster detection methods failed") + return None + +def get_aks_cluster_id_from_kubectl() -> Optional[str]: + """ + Get AKS cluster resource ID using kubectl and Azure CLI. + + Returns: + str: The full Azure resource ID of the AKS cluster if found, None otherwise + """ + try: + # Check if kubectl is available and connected to a cluster + try: + logging.info("Checking kubectl context...") + result = subprocess.run( + ["kubectl", "config", "current-context"], + capture_output=True, + text=True, + timeout=10 + ) + if result.returncode != 0: + logging.warning(f"kubectl not connected to a cluster. Return code: {result.returncode}, stderr: {result.stderr}") + return None + + current_context = result.stdout.strip() + logging.info(f"Current kubectl context: {current_context}") + + except Exception as e: + logging.warning(f"Failed to get kubectl context: {e}") + return None + + # Try to extract cluster name from kubectl context and find via Azure CLI + try: + logging.info("Attempting to extract cluster name from kubectl context...") + + # Try multiple context parsing strategies + potential_cluster_names = [] + + # Strategy 1: Underscore-separated (typical AKS managed identity contexts) + if '_' in current_context: + context_parts = current_context.split('_') + logging.info(f"Context parts (underscore): {context_parts}") + if len(context_parts) >= 2: + potential_cluster_names.append(context_parts[-1]) # Last part + if len(context_parts) >= 3: + potential_cluster_names.append(context_parts[-2]) # Second to last + + # Strategy 2: Direct context name (often the cluster name itself) + potential_cluster_names.append(current_context) + + # Strategy 3: Hyphen-separated (some naming conventions) + if '-' in current_context: + # Try the full name first, then parts + context_parts = current_context.split('-') + logging.info(f"Context parts (hyphen): {context_parts}") + # Add variations of hyphen-separated parts + if len(context_parts) >= 2: + potential_cluster_names.append('-'.join(context_parts[:-1])) # All but last + potential_cluster_names.append(context_parts[0]) # First part + + # Remove duplicates while preserving order + seen = set() + unique_names = [] + for name in potential_cluster_names: + if name not in seen: + seen.add(name) + unique_names.append(name) + + logging.info(f"Potential cluster names to try: {unique_names}") + + # Try each potential cluster name + for potential_cluster_name in unique_names: + logging.info(f"Searching for cluster '{potential_cluster_name}' via Azure CLI...") + result = subprocess.run( + ["az", "aks", "list", "--query", f"[?name=='{potential_cluster_name}'].id", "-o", "tsv"], + capture_output=True, + text=True, + timeout=30 + ) + + logging.info(f"Azure CLI result for '{potential_cluster_name}' - return code: {result.returncode}, stdout: '{result.stdout.strip()}', stderr: '{result.stderr.strip()}'") + + if result.returncode == 0 and result.stdout.strip(): + cluster_ids = result.stdout.strip().split('\n') + if cluster_ids and cluster_ids[0]: + cluster_id = cluster_ids[0] + logging.info(f"Found AKS cluster from context name '{potential_cluster_name}': {cluster_id}") + return cluster_id + else: + logging.info(f"No cluster found with name '{potential_cluster_name}'") + + logging.warning(f"No AKS clusters found for any potential names from context '{current_context}'") + + except Exception as e: + logging.warning(f"Failed to get cluster from context name: {e}") + + # Fallback: Get cluster server URL and parse it + try: + logging.info("Attempting to get cluster server URL...") + result = subprocess.run( + ["kubectl", "config", "view", "--minify", "--output", "jsonpath={.clusters[].cluster.server}"], + capture_output=True, + text=True, + timeout=10 + ) + if result.returncode == 0: + server_url = result.stdout.strip() + logging.info(f"Cluster server URL: {server_url}") + + # AKS cluster URLs pattern: https://myakscluster-12345678.hcp.eastus.azmk8s.io:443 + aks_pattern = r'https://([^-]+)-[^.]+\.hcp\.([^.]+)\.azmk8s\.io' + match = re.match(aks_pattern, server_url) + + if match: + cluster_name = match.group(1) + region = match.group(2) + logging.info(f"Detected AKS cluster name: {cluster_name}, region: {region}") + + # Find the full resource ID using Azure CLI + return find_aks_cluster_by_name_and_region(cluster_name, region) + else: + logging.warning(f"Server URL '{server_url}' does not match AKS pattern") + else: + logging.warning(f"Failed to get cluster server URL. Return code: {result.returncode}") + + except Exception as e: + logging.warning(f"Failed to parse cluster server URL: {e}") + + except Exception as e: + logging.error(f"Failed to get AKS cluster ID from kubectl: {e}") + + return None + +def find_aks_cluster_by_name_and_region(cluster_name: str, region: str) -> Optional[str]: + """ + Find AKS cluster resource ID by name and region using Azure CLI. + + Args: + cluster_name: Name of the AKS cluster + region: Azure region of the cluster + + Returns: + str: Full Azure resource ID if found, None otherwise + """ + try: + # Try to find the cluster using Azure CLI + logging.info(f"Searching for cluster '{cluster_name}' in region '{region}' via Azure CLI...") + result = subprocess.run( + ["az", "aks", "list", "--query", f"[?name=='{cluster_name}' && location=='{region}'].id", "-o", "tsv"], + capture_output=True, + text=True, + timeout=30 + ) + + logging.info(f"Azure CLI name+region search - return code: {result.returncode}, stdout: '{result.stdout.strip()}', stderr: '{result.stderr.strip()}'") + + if result.returncode == 0 and result.stdout.strip(): + cluster_id = result.stdout.strip() + logging.info(f"Found AKS cluster via Azure CLI (name+region): {cluster_id}") + return cluster_id + + # If exact match fails, try searching by name only + logging.info(f"Name+region search failed, trying name-only search for '{cluster_name}'...") + result = subprocess.run( + ["az", "aks", "list", "--query", f"[?name=='{cluster_name}'].id", "-o", "tsv"], + capture_output=True, + text=True, + timeout=30 + ) + + logging.info(f"Azure CLI name-only search - return code: {result.returncode}, stdout: '{result.stdout.strip()}', stderr: '{result.stderr.strip()}'") + + if result.returncode == 0 and result.stdout.strip(): + cluster_ids = result.stdout.strip().split('\n') + if len(cluster_ids) == 1: + cluster_id = cluster_ids[0] + logging.info(f"Found AKS cluster by name via Azure CLI: {cluster_id}") + return cluster_id + elif len(cluster_ids) > 1: + logging.info(f"Multiple clusters found with name {cluster_name}, using first one: {cluster_ids[0]}") + return cluster_ids[0] + else: + logging.warning(f"No clusters found with name '{cluster_name}'") + + except Exception as e: + logging.error(f"Failed to find AKS cluster via Azure CLI: {e}") + + return None + +def extract_cluster_name_from_resource_id(resource_id: str) -> Optional[str]: + """ + Extract the cluster name from an Azure resource ID. + + Args: + resource_id: Full Azure resource ID + + Returns: + str: Cluster name if extracted successfully, None otherwise + """ + try: + # Azure resource ID format: + # /subscriptions/{subscription-id}/resourceGroups/{resource-group}/providers/Microsoft.ContainerService/managedClusters/{cluster-name} + parts = resource_id.split("/") + if len(parts) >= 9 and parts[-2] == "managedClusters": + return parts[-1] + except Exception as e: + logging.debug(f"Failed to extract cluster name from resource ID {resource_id}: {e}") + + return None + +def execute_azure_resource_graph_query(query: str, subscription_id: str) -> Optional[Dict]: + """ + Execute an Azure Resource Graph query. + + Args: + query: The Azure Resource Graph query to execute + subscription_id: The subscription ID to query + + Returns: + dict: Query results if successful, None otherwise + """ + try: + credential = DefaultAzureCredential() + + # Create Resource Graph client + graph_client = ResourceGraphClient(credential) + + # Create the query request + query_request = QueryRequest( + query=query, + subscriptions=[subscription_id] + ) + + # Execute the query + query_response = graph_client.resources(query_request) + + if query_response and hasattr(query_response, 'data'): + return { + "data": query_response.data, + "total_records": getattr(query_response, 'total_records', 0), + "count": getattr(query_response, 'count', 0) + } + + except ImportError: + logging.warning("azure-mgmt-resourcegraph package not available. Install it with: pip install azure-mgmt-resourcegraph") + return None + except AzureError as e: + logging.error(f"Azure error executing Resource Graph query: {e}") + except Exception as e: + logging.error(f"Unexpected error executing Resource Graph query: {e}") + + return None + +def get_container_insights_workspace_for_cluster(cluster_resource_id: str) -> Optional[Dict]: + """ + Get Log Analytics workspace details for Container Insights on a given AKS cluster using Azure Resource Graph. + + Args: + cluster_resource_id: Full Azure resource ID of the AKS cluster + + Returns: + dict: Container Insights and Log Analytics workspace details if found, None otherwise + """ + try: + # Extract subscription ID from cluster resource ID + parts = cluster_resource_id.split("/") + if len(parts) >= 3: + subscription_id = parts[2] + else: + logging.error(f"Invalid cluster resource ID format: {cluster_resource_id}") + return None + + # The ARG query from the requirements to detect Container Insights + query = f""" + resources + | where type == "microsoft.insights/datacollectionrules" + | extend extensions = properties.dataSources.extensions + | extend flows = properties.dataFlows + | mv-expand extensions + | where extensions.name contains "ContainerInsightsExtension" + | extend extensionStreams = extensions.streams + | extend dataCollectionSettings = extensions.extensionSettings.dataCollectionSettings + | extend loganalytics_workspaceid=properties.destinations.logAnalytics[0].workspaceId + | extend loganalytics_workspace_resourceid=properties.destinations.logAnalytics[0].workspaceResourceId + | project dcrResourceId = tolower(id), dataCollectionSettings, extensionStreams, flows, loganalytics_workspaceid, loganalytics_workspace_resourceid + | join (insightsresources | extend ClusterId = split(tolower(id), '/providers/microsoft.insights/datacollectionruleassociations/')[0] + | where ClusterId =~ "{cluster_resource_id.lower()}" + | project dcrResourceId = tolower(tostring(parse_json(properties).dataCollectionRuleId)), dcraName = name) on dcrResourceId + | project dcrResourceId, dataCollectionSettings, extensionStreams, flows, dcraName, loganalytics_workspaceid, loganalytics_workspace_full_azure_resourceid=loganalytics_workspace_resourceid + | order by dcrResourceId + """ + + result = execute_azure_resource_graph_query(query, subscription_id) + + if result and result.get("data"): + data = result["data"] + if isinstance(data, list) and len(data) > 0: + # Take the first result + first_result = data[0] + + workspace_id = first_result.get("loganalytics_workspaceid") + workspace_resource_id = first_result.get("loganalytics_workspace_full_azure_resourceid") + + if workspace_id and workspace_resource_id: + return { + "log_analytics_workspace_id": workspace_id, + "log_analytics_workspace_resource_id": workspace_resource_id, + "data_collection_rule_id": first_result.get("dcrResourceId"), + "data_collection_rule_association_name": first_result.get("dcraName"), + "data_collection_settings": first_result.get("dataCollectionSettings"), + "extension_streams": first_result.get("extensionStreams", []), + "data_flows": first_result.get("flows", []) + } + + logging.info(f"No Container Insights (Azure Monitor Logs) found for cluster {cluster_resource_id}") + return None + + except Exception as e: + logging.error(f"Failed to get Container Insights workspace for cluster {cluster_resource_id}: {e}") + return None + +def map_streams_to_log_analytics_tables(extension_streams: List[str]) -> Dict[str, str]: + """ + Map Container Insights extension streams to Log Analytics table names. + + Args: + extension_streams: List of extension stream names + + Returns: + dict: Mapping of stream names to Log Analytics table names + """ + # Common mappings from Container Insights streams to Log Analytics tables + stream_to_table_mapping = { + "Microsoft-ContainerLog": "ContainerLog", + "Microsoft-ContainerLogV2": "ContainerLogV2", + "Microsoft-KubeEvents": "KubeEvents", + "Microsoft-KubePodInventory": "KubePodInventory", + "Microsoft-KubeNodeInventory": "KubeNodeInventory", + "Microsoft-KubeServices": "KubeServices", + "Microsoft-Perf": "Perf", + "Microsoft-InsightsMetrics": "InsightsMetrics", + "Microsoft-ContainerInventory": "ContainerInventory", + "Microsoft-ContainerNodeInventory": "ContainerNodeInventory" + } + + result = {} + for stream in extension_streams: + if stream in stream_to_table_mapping: + result[stream] = stream_to_table_mapping[stream] + else: + # Best guess: remove "Microsoft-" prefix if present + table_name = stream.replace("Microsoft-", "") if stream.startswith("Microsoft-") else stream + result[stream] = table_name + + return result + +def generate_azure_mcp_guidance(workspace_details: Dict, cluster_resource_id: str) -> Dict[str, str]: + """ + Generate guidance for configuring Azure MCP server with detected workspace details. + + Args: + workspace_details: Container Insights workspace details + cluster_resource_id: Full Azure resource ID of the AKS cluster + + Returns: + dict: Configuration guidance for Azure MCP server + """ + extension_streams = workspace_details.get("extension_streams", []) + stream_to_table = map_streams_to_log_analytics_tables(extension_streams) + + available_tables = list(stream_to_table.values()) + + return { + "workspace_id": workspace_details.get("log_analytics_workspace_id"), + "workspace_resource_id": workspace_details.get("log_analytics_workspace_resource_id"), + "cluster_filter_kql": f'| where _ResourceId == "{cluster_resource_id}"', + "available_log_tables": available_tables, + "stream_to_table_mapping": stream_to_table, + "sample_kql_query": f'ContainerLogV2 | where _ResourceId == "{cluster_resource_id}" | limit 10' + } diff --git a/poetry.lock b/poetry.lock index fc3a16697..56ca49680 100644 --- a/poetry.lock +++ b/poetry.lock @@ -1,4 +1,4 @@ -# This file is automatically @generated by Poetry 1.8.4 and should not be changed by hand. +# This file is automatically @generated by Poetry 2.1.3 and should not be changed by hand. [[package]] name = "aiohappyeyeballs" @@ -6,6 +6,7 @@ version = "2.6.1" description = "Happy Eyeballs for asyncio" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "aiohappyeyeballs-2.6.1-py3-none-any.whl", hash = "sha256:f349ba8f4b75cb25c99c5c2d84e997e485204d2902a9597802b0371f09331fb8"}, {file = "aiohappyeyeballs-2.6.1.tar.gz", hash = "sha256:c3f9d0113123803ccadfdf3f0faa505bc78e6a72d1cc4806cbd719826e943558"}, @@ -17,6 +18,7 @@ version = "3.12.15" description = "Async http client/server framework (asyncio)" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "aiohttp-3.12.15-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:b6fc902bff74d9b1879ad55f5404153e2b33a82e72a95c89cec5eb6cc9e92fbc"}, {file = "aiohttp-3.12.15-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:098e92835b8119b54c693f2f88a1dec690e20798ca5f5fe5f0520245253ee0af"}, @@ -117,7 +119,7 @@ propcache = ">=0.2.0" yarl = ">=1.17.0,<2.0" [package.extras] -speedups = ["Brotli", "aiodns (>=3.3.0)", "brotlicffi"] +speedups = ["Brotli ; platform_python_implementation == \"CPython\"", "aiodns (>=3.3.0)", "brotlicffi ; platform_python_implementation != \"CPython\""] [[package]] name = "aiosignal" @@ -125,6 +127,7 @@ version = "1.4.0" description = "aiosignal: a list of registered asynchronous callbacks" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "aiosignal-1.4.0-py3-none-any.whl", hash = "sha256:053243f8b92b990551949e63930a839ff0cf0b0ebbe0597b0f3fb19e1a0fe82e"}, {file = "aiosignal-1.4.0.tar.gz", hash = "sha256:f47eecd9468083c2029cc99945502cb7708b082c232f9aca65da147157b251c7"}, @@ -140,6 +143,7 @@ version = "0.7.0" description = "Reusable constraint types to use with typing.Annotated" optional = false python-versions = ">=3.8" +groups = ["main"] files = [ {file = "annotated_types-0.7.0-py3-none-any.whl", hash = "sha256:1f02e8b43a8fbbc3f3e0d4f0f4bfc8131bcb4eebe8849b8e5c773f3a1c582a53"}, {file = "annotated_types-0.7.0.tar.gz", hash = "sha256:aff07c09a53a08bc8cfccb9c85b05f1aa9a2a6f23728d790723543408344ce89"}, @@ -151,6 +155,7 @@ version = "4.9.0" description = "High level compatibility layer for multiple asynchronous event loop implementations" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "anyio-4.9.0-py3-none-any.whl", hash = "sha256:9f76d541cad6e36af7beb62e978876f3b41e3e04f2c1fbf0884604c0a9c4d93c"}, {file = "anyio-4.9.0.tar.gz", hash = "sha256:673c0c244e15788651a4ff38710fea9675823028a6f08a5eda409e0c9840a028"}, @@ -164,7 +169,7 @@ typing_extensions = {version = ">=4.5", markers = "python_version < \"3.13\""} [package.extras] doc = ["Sphinx (>=8.2,<9.0)", "packaging", "sphinx-autodoc-typehints (>=1.2.0)", "sphinx_rtd_theme"] -test = ["anyio[trio]", "blockbuster (>=1.5.23)", "coverage[toml] (>=7)", "exceptiongroup (>=1.2.0)", "hypothesis (>=4.0)", "psutil (>=5.9)", "pytest (>=7.0)", "trustme", "truststore (>=0.9.1)", "uvloop (>=0.21)"] +test = ["anyio[trio]", "blockbuster (>=1.5.23)", "coverage[toml] (>=7)", "exceptiongroup (>=1.2.0)", "hypothesis (>=4.0)", "psutil (>=5.9)", "pytest (>=7.0)", "trustme", "truststore (>=0.9.1) ; python_version >= \"3.10\"", "uvloop (>=0.21) ; platform_python_implementation == \"CPython\" and platform_system != \"Windows\" and python_version < \"3.14\""] trio = ["trio (>=0.26.1)"] [[package]] @@ -173,6 +178,8 @@ version = "5.0.1" description = "Timeout context manager for asyncio programs" optional = false python-versions = ">=3.8" +groups = ["main"] +markers = "python_version == \"3.10\"" files = [ {file = "async_timeout-5.0.1-py3-none-any.whl", hash = "sha256:39e3809566ff85354557ec2398b55e096c8364bacac9405a7a1fa429e77fe76c"}, {file = "async_timeout-5.0.1.tar.gz", hash = "sha256:d9321a7a3d5a6a5e187e824d2fa0793ce379a202935782d555d6e9d2735677d3"}, @@ -184,18 +191,19 @@ version = "25.3.0" description = "Classes Without Boilerplate" optional = false python-versions = ">=3.8" +groups = ["main", "dev"] files = [ {file = "attrs-25.3.0-py3-none-any.whl", hash = "sha256:427318ce031701fea540783410126f03899a97ffc6f61596ad581ac2e40e3bc3"}, {file = "attrs-25.3.0.tar.gz", hash = "sha256:75d7cefc7fb576747b2c81b4442d4d4a1ce0900973527c011d1030fd3bf4af1b"}, ] [package.extras] -benchmark = ["cloudpickle", "hypothesis", "mypy (>=1.11.1)", "pympler", "pytest (>=4.3.0)", "pytest-codspeed", "pytest-mypy-plugins", "pytest-xdist[psutil]"] -cov = ["cloudpickle", "coverage[toml] (>=5.3)", "hypothesis", "mypy (>=1.11.1)", "pympler", "pytest (>=4.3.0)", "pytest-mypy-plugins", "pytest-xdist[psutil]"] -dev = ["cloudpickle", "hypothesis", "mypy (>=1.11.1)", "pre-commit-uv", "pympler", "pytest (>=4.3.0)", "pytest-mypy-plugins", "pytest-xdist[psutil]"] +benchmark = ["cloudpickle ; platform_python_implementation == \"CPython\"", "hypothesis", "mypy (>=1.11.1) ; platform_python_implementation == \"CPython\" and python_version >= \"3.10\"", "pympler", "pytest (>=4.3.0)", "pytest-codspeed", "pytest-mypy-plugins ; platform_python_implementation == \"CPython\" and python_version >= \"3.10\"", "pytest-xdist[psutil]"] +cov = ["cloudpickle ; platform_python_implementation == \"CPython\"", "coverage[toml] (>=5.3)", "hypothesis", "mypy (>=1.11.1) ; platform_python_implementation == \"CPython\" and python_version >= \"3.10\"", "pympler", "pytest (>=4.3.0)", "pytest-mypy-plugins ; platform_python_implementation == \"CPython\" and python_version >= \"3.10\"", "pytest-xdist[psutil]"] +dev = ["cloudpickle ; platform_python_implementation == \"CPython\"", "hypothesis", "mypy (>=1.11.1) ; platform_python_implementation == \"CPython\" and python_version >= \"3.10\"", "pre-commit-uv", "pympler", "pytest (>=4.3.0)", "pytest-mypy-plugins ; platform_python_implementation == \"CPython\" and python_version >= \"3.10\"", "pytest-xdist[psutil]"] docs = ["cogapp", "furo", "myst-parser", "sphinx", "sphinx-notfound-page", "sphinxcontrib-towncrier", "towncrier"] -tests = ["cloudpickle", "hypothesis", "mypy (>=1.11.1)", "pympler", "pytest (>=4.3.0)", "pytest-mypy-plugins", "pytest-xdist[psutil]"] -tests-mypy = ["mypy (>=1.11.1)", "pytest-mypy-plugins"] +tests = ["cloudpickle ; platform_python_implementation == \"CPython\"", "hypothesis", "mypy (>=1.11.1) ; platform_python_implementation == \"CPython\" and python_version >= \"3.10\"", "pympler", "pytest (>=4.3.0)", "pytest-mypy-plugins ; platform_python_implementation == \"CPython\" and python_version >= \"3.10\"", "pytest-xdist[psutil]"] +tests-mypy = ["mypy (>=1.11.1) ; platform_python_implementation == \"CPython\" and python_version >= \"3.10\"", "pytest-mypy-plugins ; platform_python_implementation == \"CPython\" and python_version >= \"3.10\""] [[package]] name = "autoevals" @@ -203,6 +211,7 @@ version = "0.0.129" description = "Universal library for evaluating AI models" optional = false python-versions = ">=3.8.0" +groups = ["dev"] files = [ {file = "autoevals-0.0.129-py3-none-any.whl", hash = "sha256:7240e4e4bf1843bb5bc688b71fe2c6159596d3b5891bf34576941f17e04fe3ba"}, {file = "autoevals-0.0.129.tar.gz", hash = "sha256:b7a6e45f8d4dd2bec0666602c78515b2f2c9f1a5c2a6b6275ad6cc3cac63e348"}, @@ -227,6 +236,7 @@ version = "1.1.28" description = "Microsoft Azure Client Library for Python (Common)" optional = false python-versions = "*" +groups = ["main"] files = [ {file = "azure-common-1.1.28.zip", hash = "sha256:4ac0cd3214e36b6a1b6a442686722a5d8cc449603aa833f3f0f40bda836704a3"}, {file = "azure_common-1.1.28-py2.py3-none-any.whl", hash = "sha256:5c12d3dcf4ec20599ca6b0d3e09e86e146353d443e7fcc050c9a19c1f9df20ad"}, @@ -238,6 +248,7 @@ version = "1.35.0" description = "Microsoft Azure Core Library for Python" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "azure_core-1.35.0-py3-none-any.whl", hash = "sha256:8db78c72868a58f3de8991eb4d22c4d368fae226dac1002998d6c50437e7dad1"}, {file = "azure_core-1.35.0.tar.gz", hash = "sha256:c0be528489485e9ede59b6971eb63c1eaacf83ef53001bfe3904e475e972be5c"}, @@ -258,6 +269,7 @@ version = "1.23.1" description = "Microsoft Azure Identity Library for Python" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "azure_identity-1.23.1-py3-none-any.whl", hash = "sha256:7eed28baa0097a47e3fb53bd35a63b769e6b085bb3cb616dfce2b67f28a004a1"}, {file = "azure_identity-1.23.1.tar.gz", hash = "sha256:226c1ef982a9f8d5dcf6e0f9ed35eaef2a4d971e7dd86317e9b9d52e70a035e4"}, @@ -276,6 +288,7 @@ version = "1.0.0" description = "Microsoft Azure Alerts Management Client Library for Python" optional = false python-versions = "*" +groups = ["main"] files = [ {file = "azure-mgmt-alertsmanagement-1.0.0.zip", hash = "sha256:b2988efb77fe7f2fa795ed763efc3f48d836ed3597bdd24b66fa02d58143bb3f"}, {file = "azure_mgmt_alertsmanagement-1.0.0-py2.py3-none-any.whl", hash = "sha256:4d87e1e5b90dad0cc7d722bd8f06d55a70e11148920566d8593694583a8f8084"}, @@ -292,6 +305,7 @@ version = "1.6.0" description = "Microsoft Azure Management Core Library for Python" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "azure_mgmt_core-1.6.0-py3-none-any.whl", hash = "sha256:0460d11e85c408b71c727ee1981f74432bc641bb25dfcf1bb4e90a49e776dbc4"}, {file = "azure_mgmt_core-1.6.0.tar.gz", hash = "sha256:b26232af857b021e61d813d9f4ae530465255cb10b3dde945ad3743f7a58e79c"}, @@ -306,6 +320,7 @@ version = "7.0.0" description = "Microsoft Azure Monitor Client Library for Python" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "azure_mgmt_monitor-7.0.0-py3-none-any.whl", hash = "sha256:ad63b5d187e21d2d34366271ade6abbeea1fcf76e313ff0f83d394d9c124aa1b"}, {file = "azure_mgmt_monitor-7.0.0.tar.gz", hash = "sha256:b75f536441d430f69ff873a1646e5f5dbcb3080a10768a59d0adc01541623816"}, @@ -323,6 +338,7 @@ version = "23.4.0" description = "Microsoft Azure Resource Management Client Library for Python" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "azure_mgmt_resource-23.4.0-py3-none-any.whl", hash = "sha256:17282d20dc643fb68a286c2c350a191dc3bf386827f65a7daeeb714367a19316"}, {file = "azure_mgmt_resource-23.4.0.tar.gz", hash = "sha256:7cc0909184bd01439e245f5f2e20945a3668d45a6774e1f008227bb33a733d16"}, @@ -334,12 +350,30 @@ azure-mgmt-core = ">=1.5.0" isodate = ">=0.6.1" typing-extensions = ">=4.6.0" +[[package]] +name = "azure-mgmt-resourcegraph" +version = "8.0.0" +description = "Microsoft Azure Resource Graph Client Library for Python" +optional = false +python-versions = "*" +groups = ["main"] +files = [ + {file = "azure-mgmt-resourcegraph-8.0.0.zip", hash = "sha256:d25f01dae3897780fb3ddca16d1625b6347c32f1b581c767fba5ef3b24443f11"}, + {file = "azure_mgmt_resourcegraph-8.0.0-py2.py3-none-any.whl", hash = "sha256:0cf55f7ea82dc03e69d0fae0f1606e09b08b80b6ae23bd597d8b62b1ed938ace"}, +] + +[package.dependencies] +azure-common = ">=1.1,<2.0" +azure-mgmt-core = ">=1.2.0,<2.0.0" +msrest = ">=0.6.21" + [[package]] name = "azure-mgmt-sql" version = "4.0.0b22" description = "Microsoft Azure SQL Management Client Library for Python" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "azure_mgmt_sql-4.0.0b22-py3-none-any.whl", hash = "sha256:79261d114512e14d014193b4b533a3b9a190f7cafe924f36b4015b1752dfa75c"}, {file = "azure_mgmt_sql-4.0.0b22.tar.gz", hash = "sha256:92edd837d5bd0b2c78cec2b102ce24f7fa1e0d7029ce2daea80511a9aef61f49"}, @@ -357,6 +391,7 @@ version = "1.4.1" description = "Microsoft Azure Monitor Query Client Library for Python" optional = false python-versions = ">=3.8" +groups = ["main"] files = [ {file = "azure_monitor_query-1.4.1-py3-none-any.whl", hash = "sha256:192c5d8efec48b434f803aa3850ce73b869975507a12b3a823012fc100056ac0"}, {file = "azure_monitor_query-1.4.1.tar.gz", hash = "sha256:71824e2b577d25df0d3bebbbb054c06a1ae3ebcb91831a9bac0bb344d0addf68"}, @@ -373,13 +408,14 @@ version = "2.17.0" description = "Internationalization utilities" optional = false python-versions = ">=3.8" +groups = ["dev"] files = [ {file = "babel-2.17.0-py3-none-any.whl", hash = "sha256:4d0b53093fdfb4b21c92b5213dba5a1b23885afa8383709427046b21c366e5f2"}, {file = "babel-2.17.0.tar.gz", hash = "sha256:0c54cffb19f690cdcc52a3b50bcbf71e07a808d1c80d549f2459b9d2cf0afb9d"}, ] [package.extras] -dev = ["backports.zoneinfo", "freezegun (>=1.0,<2.0)", "jinja2 (>=3.0)", "pytest (>=6.0)", "pytest-cov", "pytz", "setuptools", "tzdata"] +dev = ["backports.zoneinfo ; python_version < \"3.9\"", "freezegun (>=1.0,<2.0)", "jinja2 (>=3.0)", "pytest (>=6.0)", "pytest-cov", "pytz", "setuptools", "tzdata ; sys_platform == \"win32\""] [[package]] name = "backoff" @@ -387,6 +423,7 @@ version = "2.2.1" description = "Function decoration for backoff and retry" optional = false python-versions = ">=3.7,<4.0" +groups = ["main"] files = [ {file = "backoff-2.2.1-py3-none-any.whl", hash = "sha256:63579f9a0628e06278f7e47b7d7d5b6ce20dc65c5e96a6f3ca99a6adca0396e8"}, {file = "backoff-2.2.1.tar.gz", hash = "sha256:03f829f5bb1923180821643f8753b0502c3b682293992485b0eef2807afa5cba"}, @@ -398,6 +435,7 @@ version = "5.9" description = "A wrapper around re and regex that adds additional back references." optional = false python-versions = ">=3.9" +groups = ["dev"] files = [ {file = "backrefs-5.9-py310-none-any.whl", hash = "sha256:db8e8ba0e9de81fcd635f440deab5ae5f2591b54ac1ebe0550a2ca063488cd9f"}, {file = "backrefs-5.9-py311-none-any.whl", hash = "sha256:6907635edebbe9b2dc3de3a2befff44d74f30a4562adbb8b36f21252ea19c5cf"}, @@ -417,6 +455,7 @@ version = "4.13.4" description = "Screen-scraping library" optional = false python-versions = ">=3.7.0" +groups = ["main"] files = [ {file = "beautifulsoup4-4.13.4-py3-none-any.whl", hash = "sha256:9bbbb14bfde9d79f38b8cd5f8c7c85f4b8f2523190ebed90e950a8dea4cb1c4b"}, {file = "beautifulsoup4-4.13.4.tar.gz", hash = "sha256:dbb3c4e1ceae6aefebdaf2423247260cd062430a410e38c66f2baa50a8437195"}, @@ -435,17 +474,18 @@ lxml = ["lxml"] [[package]] name = "boto3" -version = "1.39.17" +version = "1.40.1" description = "The AWS SDK for Python" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ - {file = "boto3-1.39.17-py3-none-any.whl", hash = "sha256:6af9f7d6db7b5e72d6869ae22ebad1b0c6602591af2ef5d914b331a055953df5"}, - {file = "boto3-1.39.17.tar.gz", hash = "sha256:a6904a40b1c61f6a1766574b3155ec75a6020399fb570be2b51bf93a2c0a2b3d"}, + {file = "boto3-1.40.1-py3-none-any.whl", hash = "sha256:7c007d5c8ee549e9fcad0927536502da199b27891006ef515330f429aca9671f"}, + {file = "boto3-1.40.1.tar.gz", hash = "sha256:985ed4bf64729807f870eadbc46ad98baf93096917f7194ec39d743ff75b3f1d"}, ] [package.dependencies] -botocore = ">=1.39.17,<1.40.0" +botocore = ">=1.40.1,<1.41.0" jmespath = ">=0.7.1,<2.0.0" s3transfer = ">=0.13.0,<0.14.0" @@ -454,13 +494,14 @@ crt = ["botocore[crt] (>=1.21.0,<2.0a0)"] [[package]] name = "botocore" -version = "1.39.17" +version = "1.40.1" description = "Low-level, data-driven core of boto 3." optional = false python-versions = ">=3.9" +groups = ["main"] files = [ - {file = "botocore-1.39.17-py3-none-any.whl", hash = "sha256:41db169e919f821b3ef684794c5e67dd7bb1f5ab905d33729b1d8c27fafe8004"}, - {file = "botocore-1.39.17.tar.gz", hash = "sha256:1a1f0b29dab5d1b10d16f14423c16ac0a3043272f579e9ab0d757753ee9a7d2b"}, + {file = "botocore-1.40.1-py3-none-any.whl", hash = "sha256:e039774b55fbd6fe59f0f4fea51d156a2433bd4d8faa64fc1b87aee9a03f415d"}, + {file = "botocore-1.40.1.tar.gz", hash = "sha256:bdf30e2c0e8cdb939d81fc243182a6d1dd39c416694b406c5f2ea079b1c2f3f5"}, ] [package.dependencies] @@ -477,6 +518,7 @@ version = "0.1.8" description = "SDK for integrating Braintrust" optional = false python-versions = ">=3.8.0" +groups = ["dev"] files = [ {file = "braintrust-0.1.8-py3-none-any.whl", hash = "sha256:ce61d4468e6b0f26198e1bf00d9219cf1ea53df32c3dc17b8f82ac8c632cccfe"}, {file = "braintrust-0.1.8.tar.gz", hash = "sha256:e79cd8bb90791fb5748ecb4df7940ee1efe4c20796dfb17909bc99841a5b2b3b"}, @@ -506,6 +548,7 @@ version = "0.0.59" description = "Shared core dependencies for Braintrust packages" optional = false python-versions = ">=3.8.0" +groups = ["dev"] files = [ {file = "braintrust_core-0.0.59-py3-none-any.whl", hash = "sha256:b9be128e1c1b4c376f082e81d314c1938aa9b8c0398ab11df4ad29fad8e655c1"}, {file = "braintrust_core-0.0.59.tar.gz", hash = "sha256:5e8f34e354a536ea8777ce2f80dfc5e93fd0c4d6d50c545e77a6792e8c5e9d49"}, @@ -517,6 +560,7 @@ version = "0.0.2" description = "Dummy package for Beautiful Soup (beautifulsoup4)" optional = false python-versions = "*" +groups = ["main"] files = [ {file = "bs4-0.0.2-py2.py3-none-any.whl", hash = "sha256:abf8742c0805ef7f662dce4b51cca104cffe52b835238afc169142ab9b3fbccc"}, {file = "bs4-0.0.2.tar.gz", hash = "sha256:a48685c58f50fe127722417bae83fe6badf500d54b55f7e39ffe43b798653925"}, @@ -531,6 +575,7 @@ version = "5.5.2" description = "Extensible memoizing collections and decorators" optional = false python-versions = ">=3.7" +groups = ["main"] files = [ {file = "cachetools-5.5.2-py3-none-any.whl", hash = "sha256:d26a22bcc62eb95c3beabd9f1ee5e820d3d2704fe2967cbe350e20c8ffcd3f0a"}, {file = "cachetools-5.5.2.tar.gz", hash = "sha256:1a661caa9175d26759571b2e19580f9d6393969e5dfca11fdb1f947a23e640d4"}, @@ -542,6 +587,7 @@ version = "2024.12.14" description = "Python package for providing Mozilla's CA Bundle." optional = false python-versions = ">=3.6" +groups = ["main", "dev"] files = [ {file = "certifi-2024.12.14-py3-none-any.whl", hash = "sha256:1275f7a45be9464efc1173084eaa30f866fe2e47d389406136d332ed4967ec56"}, {file = "certifi-2024.12.14.tar.gz", hash = "sha256:b650d30f370c2b724812bee08008be0c4163b163ddaec3f2546c1caf65f191db"}, @@ -553,6 +599,8 @@ version = "1.17.1" description = "Foreign Function Interface for Python calling C code." optional = false python-versions = ">=3.8" +groups = ["main"] +markers = "platform_python_implementation != \"PyPy\"" files = [ {file = "cffi-1.17.1-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:df8b1c11f177bc2313ec4b2d46baec87a5f3e71fc8b45dab2ee7cae86d9aba14"}, {file = "cffi-1.17.1-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:8f2cdc858323644ab277e9bb925ad72ae0e67f69e804f4898c070998d50b1a67"}, @@ -632,6 +680,7 @@ version = "3.4.0" description = "Validate configuration and produce human readable error messages." optional = false python-versions = ">=3.8" +groups = ["dev"] files = [ {file = "cfgv-3.4.0-py2.py3-none-any.whl", hash = "sha256:b7265b1f29fd3316bfcd2b330d63d024f2bfd8bcb8b0272f8e19a504856c48f9"}, {file = "cfgv-3.4.0.tar.gz", hash = "sha256:e52591d4c5f5dead8e0f673fb16db7949d2cfb3f7da4582893288f0ded8fe560"}, @@ -643,6 +692,7 @@ version = "3.4.2" description = "The Real First Universal Charset Detector. Open, modern and actively maintained alternative to Chardet." optional = false python-versions = ">=3.7" +groups = ["main", "dev"] files = [ {file = "charset_normalizer-3.4.2-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:7c48ed483eb946e6c04ccbe02c6b4d1d48e51944b6db70f697e089c193404941"}, {file = "charset_normalizer-3.4.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:b2d318c11350e10662026ad0eb71bb51c7812fc8590825304ae0bdd4ac283acd"}, @@ -744,6 +794,7 @@ version = "0.14.0" description = "Mustache templating language renderer" optional = false python-versions = "*" +groups = ["dev"] files = [ {file = "chevron-0.14.0-py3-none-any.whl", hash = "sha256:fbf996a709f8da2e745ef763f482ce2d311aa817d287593a5b990d6d6e4f0443"}, {file = "chevron-0.14.0.tar.gz", hash = "sha256:87613aafdf6d77b6a90ff073165a61ae5086e21ad49057aa0e53681601800ebf"}, @@ -755,6 +806,7 @@ version = "8.1.8" description = "Composable command line interface toolkit" optional = false python-versions = ">=3.7" +groups = ["main", "dev"] files = [ {file = "click-8.1.8-py3-none-any.whl", hash = "sha256:63c132bbbed01578a06712a2d1f497bb62d9c1c0d329b7903a866228027263b2"}, {file = "click-8.1.8.tar.gz", hash = "sha256:ed53c9d8990d83c2a27deae68e4ee337473f6330c040a31d4225c9574d16096a"}, @@ -769,10 +821,12 @@ version = "0.4.6" description = "Cross-platform colored terminal text." optional = false python-versions = "!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*,!=3.5.*,!=3.6.*,>=2.7" +groups = ["main", "dev"] files = [ {file = "colorama-0.4.6-py2.py3-none-any.whl", hash = "sha256:4f1d9991f5acc0ca119f9d443620b77f9d6b33703e51011c16baf57afb285fc6"}, {file = "colorama-0.4.6.tar.gz", hash = "sha256:08695f5cb7ed6e0531a20572697297273c47b8cae5a63ffc6d6ed5c201be6e44"}, ] +markers = {main = "platform_system == \"Windows\" or sys_platform == \"win32\""} [[package]] name = "colorlog" @@ -780,6 +834,7 @@ version = "6.9.0" description = "Add colours to the output of Python's logging module." optional = false python-versions = ">=3.6" +groups = ["main"] files = [ {file = "colorlog-6.9.0-py3-none-any.whl", hash = "sha256:5906e71acd67cb07a71e779c47c4bcb45fb8c2993eebe9e5adcd6a6f1b283eff"}, {file = "colorlog-6.9.0.tar.gz", hash = "sha256:bfba54a1b93b94f54e1f4fe48395725a3d92fd2a4af702f6bd70946bdc0c6ac2"}, @@ -797,6 +852,7 @@ version = "2.11.0" description = "Confluent's Python client for Apache Kafka" optional = false python-versions = ">=3.7" +groups = ["main"] files = [ {file = "confluent_kafka-2.11.0-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:ae672723577775aba560da34e376ffa4a038ff3e07c8513920b81903f8f9c4e8"}, {file = "confluent_kafka-2.11.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:fc8346a55c4b3f4e4ed938243997e3223e0f00c93c48d52a86343943c0316a6c"}, @@ -836,18 +892,18 @@ files = [ ] [package.extras] -all = ["async-timeout", "attrs", "attrs", "authlib (>=1.0.0)", "authlib (>=1.0.0)", "avro (>=1.11.1,<2)", "avro (>=1.11.1,<2)", "azure-identity", "azure-identity", "azure-keyvault-keys", "azure-keyvault-keys", "boto3", "boto3 (>=1.35)", "cachetools", "cachetools", "cel-python (>=0.1.5)", "cel-python (>=0.1.5)", "confluent-kafka", "fastapi", "fastavro (<1.8.0)", "fastavro (<1.8.0)", "fastavro (<2)", "fastavro (<2)", "flake8", "google-api-core", "google-api-core", "google-auth", "google-auth", "google-cloud-kms", "google-cloud-kms", "googleapis-common-protos", "googleapis-common-protos", "hkdf (==0.0.3)", "hkdf (==0.0.3)", "httpx (>=0.26)", "httpx (>=0.26)", "hvac", "hvac", "jsonata-python", "jsonata-python", "jsonschema", "jsonschema", "opentelemetry-distro", "opentelemetry-exporter-otlp", "orjson", "pluggy (<1.6.0)", "protobuf", "protobuf", "psutil", "pydantic", "pyrsistent", "pyrsistent", "pytest", "pytest-asyncio", "pytest-timeout", "pytest_cov", "pyyaml (>=6.0.0)", "pyyaml (>=6.0.0)", "requests", "requests", "requests-mock", "respx", "six", "sphinx", "sphinx-rtd-theme", "tink", "tink", "urllib3 (<2)", "urllib3 (<3)", "uvicorn"] -avro = ["attrs", "authlib (>=1.0.0)", "avro (>=1.11.1,<2)", "cachetools", "fastavro (<1.8.0)", "fastavro (<2)", "httpx (>=0.26)", "requests"] -dev = ["async-timeout", "attrs", "attrs", "authlib (>=1.0.0)", "authlib (>=1.0.0)", "avro (>=1.11.1,<2)", "avro (>=1.11.1,<2)", "azure-identity", "azure-identity", "azure-keyvault-keys", "azure-keyvault-keys", "boto3", "boto3 (>=1.35)", "cachetools", "cachetools", "cel-python (>=0.1.5)", "cel-python (>=0.1.5)", "confluent-kafka", "fastapi", "fastavro (<1.8.0)", "fastavro (<1.8.0)", "fastavro (<2)", "fastavro (<2)", "flake8", "google-api-core", "google-api-core", "google-auth", "google-auth", "google-cloud-kms", "google-cloud-kms", "googleapis-common-protos", "googleapis-common-protos", "hkdf (==0.0.3)", "hkdf (==0.0.3)", "httpx (>=0.26)", "httpx (>=0.26)", "hvac", "hvac", "jsonata-python", "jsonata-python", "jsonschema", "jsonschema", "orjson", "pluggy (<1.6.0)", "protobuf", "protobuf", "pydantic", "pyrsistent", "pyrsistent", "pytest", "pytest-asyncio", "pytest-timeout", "pytest_cov", "pyyaml (>=6.0.0)", "pyyaml (>=6.0.0)", "requests", "requests", "requests-mock", "respx", "six", "sphinx", "sphinx-rtd-theme", "tink", "tink", "urllib3 (<2)", "urllib3 (<3)", "uvicorn"] -docs = ["attrs", "authlib (>=1.0.0)", "avro (>=1.11.1,<2)", "azure-identity", "azure-keyvault-keys", "boto3 (>=1.35)", "cachetools", "cel-python (>=0.1.5)", "fastavro (<1.8.0)", "fastavro (<2)", "google-api-core", "google-auth", "google-cloud-kms", "googleapis-common-protos", "hkdf (==0.0.3)", "httpx (>=0.26)", "hvac", "jsonata-python", "jsonschema", "protobuf", "pyrsistent", "pyyaml (>=6.0.0)", "requests", "sphinx", "sphinx-rtd-theme", "tink"] -examples = ["attrs", "authlib (>=1.0.0)", "avro (>=1.11.1,<2)", "azure-identity", "azure-keyvault-keys", "boto3", "cachetools", "cel-python (>=0.1.5)", "confluent-kafka", "fastapi", "fastavro (<1.8.0)", "fastavro (<2)", "google-api-core", "google-auth", "google-cloud-kms", "googleapis-common-protos", "hkdf (==0.0.3)", "httpx (>=0.26)", "hvac", "jsonata-python", "jsonschema", "protobuf", "pydantic", "pyrsistent", "pyyaml (>=6.0.0)", "requests", "six", "tink", "uvicorn"] +all = ["async-timeout", "attrs", "attrs", "authlib (>=1.0.0)", "authlib (>=1.0.0)", "avro (>=1.11.1,<2)", "avro (>=1.11.1,<2)", "azure-identity", "azure-identity", "azure-keyvault-keys", "azure-keyvault-keys", "boto3", "boto3 (>=1.35)", "cachetools", "cachetools", "cel-python (>=0.1.5)", "cel-python (>=0.1.5)", "confluent-kafka", "fastapi", "fastavro (<1.8.0) ; python_version == \"3.7\"", "fastavro (<1.8.0) ; python_version == \"3.7\"", "fastavro (<2) ; python_version > \"3.7\"", "fastavro (<2) ; python_version > \"3.7\"", "flake8", "google-api-core", "google-api-core", "google-auth", "google-auth", "google-cloud-kms", "google-cloud-kms", "googleapis-common-protos", "googleapis-common-protos", "hkdf (==0.0.3)", "hkdf (==0.0.3)", "httpx (>=0.26)", "httpx (>=0.26)", "hvac", "hvac", "jsonata-python", "jsonata-python", "jsonschema", "jsonschema", "opentelemetry-distro", "opentelemetry-exporter-otlp", "orjson", "pluggy (<1.6.0)", "protobuf", "protobuf", "psutil", "pydantic", "pyrsistent", "pyrsistent", "pytest", "pytest-asyncio", "pytest-timeout", "pytest_cov", "pyyaml (>=6.0.0)", "pyyaml (>=6.0.0)", "requests", "requests", "requests-mock", "respx", "six", "sphinx", "sphinx-rtd-theme", "tink", "tink", "urllib3 (<2) ; python_version <= \"3.7\"", "urllib3 (<3) ; python_version > \"3.7\"", "uvicorn"] +avro = ["attrs", "authlib (>=1.0.0)", "avro (>=1.11.1,<2)", "cachetools", "fastavro (<1.8.0) ; python_version == \"3.7\"", "fastavro (<2) ; python_version > \"3.7\"", "httpx (>=0.26)", "requests"] +dev = ["async-timeout", "attrs", "attrs", "authlib (>=1.0.0)", "authlib (>=1.0.0)", "avro (>=1.11.1,<2)", "avro (>=1.11.1,<2)", "azure-identity", "azure-identity", "azure-keyvault-keys", "azure-keyvault-keys", "boto3", "boto3 (>=1.35)", "cachetools", "cachetools", "cel-python (>=0.1.5)", "cel-python (>=0.1.5)", "confluent-kafka", "fastapi", "fastavro (<1.8.0) ; python_version == \"3.7\"", "fastavro (<1.8.0) ; python_version == \"3.7\"", "fastavro (<2) ; python_version > \"3.7\"", "fastavro (<2) ; python_version > \"3.7\"", "flake8", "google-api-core", "google-api-core", "google-auth", "google-auth", "google-cloud-kms", "google-cloud-kms", "googleapis-common-protos", "googleapis-common-protos", "hkdf (==0.0.3)", "hkdf (==0.0.3)", "httpx (>=0.26)", "httpx (>=0.26)", "hvac", "hvac", "jsonata-python", "jsonata-python", "jsonschema", "jsonschema", "orjson", "pluggy (<1.6.0)", "protobuf", "protobuf", "pydantic", "pyrsistent", "pyrsistent", "pytest", "pytest-asyncio", "pytest-timeout", "pytest_cov", "pyyaml (>=6.0.0)", "pyyaml (>=6.0.0)", "requests", "requests", "requests-mock", "respx", "six", "sphinx", "sphinx-rtd-theme", "tink", "tink", "urllib3 (<2) ; python_version <= \"3.7\"", "urllib3 (<3) ; python_version > \"3.7\"", "uvicorn"] +docs = ["attrs", "authlib (>=1.0.0)", "avro (>=1.11.1,<2)", "azure-identity", "azure-keyvault-keys", "boto3 (>=1.35)", "cachetools", "cel-python (>=0.1.5)", "fastavro (<1.8.0) ; python_version == \"3.7\"", "fastavro (<2) ; python_version > \"3.7\"", "google-api-core", "google-auth", "google-cloud-kms", "googleapis-common-protos", "hkdf (==0.0.3)", "httpx (>=0.26)", "hvac", "jsonata-python", "jsonschema", "protobuf", "pyrsistent", "pyyaml (>=6.0.0)", "requests", "sphinx", "sphinx-rtd-theme", "tink"] +examples = ["attrs", "authlib (>=1.0.0)", "avro (>=1.11.1,<2)", "azure-identity", "azure-keyvault-keys", "boto3", "cachetools", "cel-python (>=0.1.5)", "confluent-kafka", "fastapi", "fastavro (<1.8.0) ; python_version == \"3.7\"", "fastavro (<2) ; python_version > \"3.7\"", "google-api-core", "google-auth", "google-cloud-kms", "googleapis-common-protos", "hkdf (==0.0.3)", "httpx (>=0.26)", "hvac", "jsonata-python", "jsonschema", "protobuf", "pydantic", "pyrsistent", "pyyaml (>=6.0.0)", "requests", "six", "tink", "uvicorn"] json = ["attrs", "authlib (>=1.0.0)", "cachetools", "httpx (>=0.26)", "jsonschema", "pyrsistent"] protobuf = ["attrs", "authlib (>=1.0.0)", "cachetools", "googleapis-common-protos", "httpx (>=0.26)", "protobuf"] rules = ["attrs", "authlib (>=1.0.0)", "azure-identity", "azure-keyvault-keys", "boto3 (>=1.35)", "cachetools", "cel-python (>=0.1.5)", "google-api-core", "google-auth", "google-cloud-kms", "hkdf (==0.0.3)", "httpx (>=0.26)", "hvac", "jsonata-python", "pyyaml (>=6.0.0)", "tink"] schema-registry = ["attrs", "authlib (>=1.0.0)", "cachetools", "httpx (>=0.26)"] schemaregistry = ["attrs", "authlib (>=1.0.0)", "cachetools", "httpx (>=0.26)"] soaktest = ["opentelemetry-distro", "opentelemetry-exporter-otlp", "psutil"] -tests = ["async-timeout", "attrs", "authlib (>=1.0.0)", "avro (>=1.11.1,<2)", "azure-identity", "azure-keyvault-keys", "boto3 (>=1.35)", "cachetools", "cel-python (>=0.1.5)", "fastavro (<1.8.0)", "fastavro (<2)", "flake8", "google-api-core", "google-auth", "google-cloud-kms", "googleapis-common-protos", "hkdf (==0.0.3)", "httpx (>=0.26)", "hvac", "jsonata-python", "jsonschema", "orjson", "pluggy (<1.6.0)", "protobuf", "pyrsistent", "pytest", "pytest-asyncio", "pytest-timeout", "pytest_cov", "pyyaml (>=6.0.0)", "requests", "requests-mock", "respx", "tink", "urllib3 (<2)", "urllib3 (<3)"] +tests = ["async-timeout", "attrs", "authlib (>=1.0.0)", "avro (>=1.11.1,<2)", "azure-identity", "azure-keyvault-keys", "boto3 (>=1.35)", "cachetools", "cel-python (>=0.1.5)", "fastavro (<1.8.0) ; python_version == \"3.7\"", "fastavro (<2) ; python_version > \"3.7\"", "flake8", "google-api-core", "google-auth", "google-cloud-kms", "googleapis-common-protos", "hkdf (==0.0.3)", "httpx (>=0.26)", "hvac", "jsonata-python", "jsonschema", "orjson", "pluggy (<1.6.0)", "protobuf", "pyrsistent", "pytest", "pytest-asyncio", "pytest-timeout", "pytest_cov", "pyyaml (>=6.0.0)", "requests", "requests-mock", "respx", "tink", "urllib3 (<2) ; python_version <= \"3.7\"", "urllib3 (<3) ; python_version > \"3.7\""] [[package]] name = "coverage" @@ -855,6 +911,7 @@ version = "7.10.1" description = "Code coverage measurement for Python" optional = false python-versions = ">=3.9" +groups = ["dev"] files = [ {file = "coverage-7.10.1-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:1c86eb388bbd609d15560e7cc0eb936c102b6f43f31cf3e58b4fd9afe28e1372"}, {file = "coverage-7.10.1-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:6b4ba0f488c1bdb6bd9ba81da50715a372119785458831c73428a8566253b86b"}, @@ -950,7 +1007,7 @@ files = [ tomli = {version = "*", optional = true, markers = "python_full_version <= \"3.11.0a6\" and extra == \"toml\""} [package.extras] -toml = ["tomli"] +toml = ["tomli ; python_full_version <= \"3.11.0a6\""] [[package]] name = "cryptography" @@ -958,6 +1015,7 @@ version = "45.0.5" description = "cryptography is a package which provides cryptographic recipes and primitives to Python developers." optional = false python-versions = "!=3.9.0,!=3.9.1,>=3.7" +groups = ["main"] files = [ {file = "cryptography-45.0.5-cp311-abi3-macosx_10_9_universal2.whl", hash = "sha256:101ee65078f6dd3e5a028d4f19c07ffa4dd22cce6a20eaa160f8b5219911e7d8"}, {file = "cryptography-45.0.5-cp311-abi3-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:3a264aae5f7fbb089dbc01e0242d3b67dffe3e6292e1f5182122bdf58e65215d"}, @@ -1002,10 +1060,10 @@ files = [ cffi = {version = ">=1.14", markers = "platform_python_implementation != \"PyPy\""} [package.extras] -docs = ["sphinx (>=5.3.0)", "sphinx-inline-tabs", "sphinx-rtd-theme (>=3.0.0)"] +docs = ["sphinx (>=5.3.0)", "sphinx-inline-tabs ; python_full_version >= \"3.8.0\"", "sphinx-rtd-theme (>=3.0.0) ; python_full_version >= \"3.8.0\""] docstest = ["pyenchant (>=3)", "readme-renderer (>=30.0)", "sphinxcontrib-spelling (>=7.3.1)"] -nox = ["nox (>=2024.4.15)", "nox[uv] (>=2024.3.2)"] -pep8test = ["check-sdist", "click (>=8.0.1)", "mypy (>=1.4)", "ruff (>=0.3.6)"] +nox = ["nox (>=2024.4.15)", "nox[uv] (>=2024.3.2) ; python_full_version >= \"3.8.0\""] +pep8test = ["check-sdist ; python_full_version >= \"3.8.0\"", "click (>=8.0.1)", "mypy (>=1.4)", "ruff (>=0.3.6)"] sdist = ["build (>=1.0.0)"] ssh = ["bcrypt (>=3.1.5)"] test = ["certifi (>=2024)", "cryptography-vectors (==45.0.5)", "pretend (>=0.7)", "pytest (>=7.4.0)", "pytest-benchmark (>=4.0)", "pytest-cov (>=2.10.1)", "pytest-xdist (>=3.5.0)"] @@ -1017,6 +1075,7 @@ version = "2.1.0" description = "A library to handle automated deprecations" optional = false python-versions = "*" +groups = ["main"] files = [ {file = "deprecation-2.1.0-py2.py3-none-any.whl", hash = "sha256:a10811591210e1fb0e768a8c25517cabeabcba6f0bf96564f8ff45189f90b14a"}, {file = "deprecation-2.1.0.tar.gz", hash = "sha256:72b3bde64e5d778694b0cf68178aed03d15e15477116add3fb773e581f9518ff"}, @@ -1031,6 +1090,7 @@ version = "0.4.0" description = "Distribution utilities" optional = false python-versions = "*" +groups = ["dev"] files = [ {file = "distlib-0.4.0-py2.py3-none-any.whl", hash = "sha256:9659f7d87e46584a30b5780e43ac7a2143098441670ff0a49d5f9034c54a6c16"}, {file = "distlib-0.4.0.tar.gz", hash = "sha256:feec40075be03a04501a973d81f633735b4b69f98b05450592310c0f401a4e0d"}, @@ -1042,6 +1102,7 @@ version = "1.9.0" description = "Distro - an OS platform information API" optional = false python-versions = ">=3.6" +groups = ["main"] files = [ {file = "distro-1.9.0-py3-none-any.whl", hash = "sha256:7bffd925d65168f85027d8da9af6bddab658135b840670a223589bc0c8ef02b2"}, {file = "distro-1.9.0.tar.gz", hash = "sha256:2fa77c6fd8940f116ee1d6b94a2f90b13b5ea8d019b98bc8bafdcabcdd9bdbed"}, @@ -1053,6 +1114,7 @@ version = "0.10" description = "Module for converting between datetime.timedelta and Go's Duration strings." optional = false python-versions = "*" +groups = ["main"] files = [ {file = "durationpy-0.10-py3-none-any.whl", hash = "sha256:3b41e1b601234296b4fb368338fdcd3e13e0b4fb5b67345948f4f2bf9868b286"}, {file = "durationpy-0.10.tar.gz", hash = "sha256:1fa6893409a6e739c9c72334fc65cca1f355dbdd93405d30f726deb5bde42fba"}, @@ -1064,6 +1126,7 @@ version = "0.5" description = "Bringing the elegance of C# EventHandler to Python" optional = false python-versions = "*" +groups = ["main"] files = [ {file = "Events-0.5-py3-none-any.whl", hash = "sha256:a7286af378ba3e46640ac9825156c93bdba7502174dd696090fdfcd4d80a1abd"}, ] @@ -1074,10 +1137,12 @@ version = "1.3.0" description = "Backport of PEP 654 (exception groups)" optional = false python-versions = ">=3.7" +groups = ["main", "dev"] files = [ {file = "exceptiongroup-1.3.0-py3-none-any.whl", hash = "sha256:4d111e6e0c13d0644cad6ddaa7ed0261a0b36971f6d23e7ec9b4b9097da78a10"}, {file = "exceptiongroup-1.3.0.tar.gz", hash = "sha256:b241f5885f560bc56a59ee63ca4c6a8bfa46ae4ad651af316d4e81817bb9fd88"}, ] +markers = {main = "python_version == \"3.10\""} [package.dependencies] typing-extensions = {version = ">=4.6.0", markers = "python_version < \"3.13\""} @@ -1091,6 +1156,7 @@ version = "2.1.1" description = "execnet: rapid multi-Python deployment" optional = false python-versions = ">=3.8" +groups = ["dev"] files = [ {file = "execnet-2.1.1-py3-none-any.whl", hash = "sha256:26dee51f1b80cebd6d0ca8e74dd8745419761d3bef34163928cbebbdc4749fdc"}, {file = "execnet-2.1.1.tar.gz", hash = "sha256:5189b52c6121c24feae288166ab41b32549c7e2348652736540b9e6e7d4e72e3"}, @@ -1105,6 +1171,7 @@ version = "0.116.1" description = "FastAPI framework, high performance, easy to learn, fast to code, ready for production" optional = false python-versions = ">=3.8" +groups = ["main"] files = [ {file = "fastapi-0.116.1-py3-none-any.whl", hash = "sha256:c46ac7c312df840f0c9e220f7964bada936781bc4e2e6eb71f1c4d7553786565"}, {file = "fastapi-0.116.1.tar.gz", hash = "sha256:ed52cbf946abfd70c5a0dccb24673f0670deeb517a88b3544d03c2a6bf283143"}, @@ -1126,6 +1193,7 @@ version = "3.18.0" description = "A platform independent file lock." optional = false python-versions = ">=3.9" +groups = ["main", "dev"] files = [ {file = "filelock-3.18.0-py3-none-any.whl", hash = "sha256:c401f4f8377c4464e6db25fff06205fd89bdd83b65eb0488ed1b160f780e21de"}, {file = "filelock-3.18.0.tar.gz", hash = "sha256:adbc88eabb99d2fec8c9c1b229b171f18afa655400173ddc653d5d01501fb9f2"}, @@ -1134,7 +1202,7 @@ files = [ [package.extras] docs = ["furo (>=2024.8.6)", "sphinx (>=8.1.3)", "sphinx-autodoc-typehints (>=3)"] testing = ["covdefaults (>=2.3)", "coverage (>=7.6.10)", "diff-cover (>=9.2.1)", "pytest (>=8.3.4)", "pytest-asyncio (>=0.25.2)", "pytest-cov (>=6)", "pytest-mock (>=3.14)", "pytest-timeout (>=2.3.1)", "virtualenv (>=20.28.1)"] -typing = ["typing-extensions (>=4.12.2)"] +typing = ["typing-extensions (>=4.12.2) ; python_version < \"3.11\""] [[package]] name = "freezegun" @@ -1142,6 +1210,7 @@ version = "1.5.4" description = "Let your Python tests travel through time" optional = false python-versions = ">=3.8" +groups = ["dev"] files = [ {file = "freezegun-1.5.4-py3-none-any.whl", hash = "sha256:8bdd75c9d790f53d5a173d273064ccd7900984b36635be552befeedb0cd47b20"}, {file = "freezegun-1.5.4.tar.gz", hash = "sha256:798b9372fdd4d907f33e8b6a58bc64e682d9ffa8d494ce60f780197ee81faed1"}, @@ -1156,6 +1225,7 @@ version = "1.7.0" description = "A list-like structure which implements collections.abc.MutableSequence" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "frozenlist-1.7.0-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:cc4df77d638aa2ed703b878dd093725b72a824c3c546c076e8fdf276f78ee84a"}, {file = "frozenlist-1.7.0-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:716a9973a2cc963160394f701964fe25012600f3d311f60c790400b00e568b61"}, @@ -1269,6 +1339,7 @@ version = "2025.7.0" description = "File-system specification" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "fsspec-2025.7.0-py3-none-any.whl", hash = "sha256:8b012e39f63c7d5f10474de957f3ab793b47b45ae7d39f2fb735f8bbe25c0e21"}, {file = "fsspec-2025.7.0.tar.gz", hash = "sha256:786120687ffa54b8283d942929540d8bc5ccfa820deb555a2b5d0ed2b737bf58"}, @@ -1299,7 +1370,7 @@ smb = ["smbprotocol"] ssh = ["paramiko"] test = ["aiohttp (!=4.0.0a0,!=4.0.0a1)", "numpy", "pytest", "pytest-asyncio (!=0.22.0)", "pytest-benchmark", "pytest-cov", "pytest-mock", "pytest-recording", "pytest-rerunfailures", "requests"] test-downstream = ["aiobotocore (>=2.5.4,<3.0.0)", "dask[dataframe,test]", "moto[server] (>4,<5)", "pytest-timeout", "xarray"] -test-full = ["adlfs", "aiohttp (!=4.0.0a0,!=4.0.0a1)", "cloudpickle", "dask", "distributed", "dropbox", "dropboxdrivefs", "fastparquet", "fusepy", "gcsfs", "jinja2", "kerchunk", "libarchive-c", "lz4", "notebook", "numpy", "ocifs", "pandas", "panel", "paramiko", "pyarrow", "pyarrow (>=1)", "pyftpdlib", "pygit2", "pytest", "pytest-asyncio (!=0.22.0)", "pytest-benchmark", "pytest-cov", "pytest-mock", "pytest-recording", "pytest-rerunfailures", "python-snappy", "requests", "smbprotocol", "tqdm", "urllib3", "zarr", "zstandard"] +test-full = ["adlfs", "aiohttp (!=4.0.0a0,!=4.0.0a1)", "cloudpickle", "dask", "distributed", "dropbox", "dropboxdrivefs", "fastparquet", "fusepy", "gcsfs", "jinja2", "kerchunk", "libarchive-c", "lz4", "notebook", "numpy", "ocifs", "pandas", "panel", "paramiko", "pyarrow", "pyarrow (>=1)", "pyftpdlib", "pygit2", "pytest", "pytest-asyncio (!=0.22.0)", "pytest-benchmark", "pytest-cov", "pytest-mock", "pytest-recording", "pytest-rerunfailures", "python-snappy", "requests", "smbprotocol", "tqdm", "urllib3", "zarr", "zstandard ; python_version < \"3.14\""] tqdm = ["tqdm"] [[package]] @@ -1308,6 +1379,7 @@ version = "2.1.0" description = "Copy your docs directly to the gh-pages branch." optional = false python-versions = "*" +groups = ["dev"] files = [ {file = "ghp-import-2.1.0.tar.gz", hash = "sha256:9c535c4c61193c2df8871222567d7fd7e5014d835f97dc7b7439069e2413d343"}, {file = "ghp_import-2.1.0-py3-none-any.whl", hash = "sha256:8337dd7b50877f163d4c0289bc1f1c7f127550241988d568c1db512c4324a619"}, @@ -1325,6 +1397,7 @@ version = "4.0.12" description = "Git Object Database" optional = false python-versions = ">=3.7" +groups = ["dev"] files = [ {file = "gitdb-4.0.12-py3-none-any.whl", hash = "sha256:67073e15955400952c6565cc3e707c554a4eea2e428946f7a4c162fab9bd9bcf"}, {file = "gitdb-4.0.12.tar.gz", hash = "sha256:5ef71f855d191a3326fcfbc0d5da835f26b13fbcba60c32c21091c349ffdb571"}, @@ -1339,6 +1412,7 @@ version = "3.1.45" description = "GitPython is a Python library used to interact with Git repositories" optional = false python-versions = ">=3.7" +groups = ["dev"] files = [ {file = "gitpython-3.1.45-py3-none-any.whl", hash = "sha256:8908cb2e02fb3b93b7eb0f2827125cb699869470432cc885f019b8fd0fccff77"}, {file = "gitpython-3.1.45.tar.gz", hash = "sha256:85b0ee964ceddf211c41b9f27a49086010a190fd8132a24e21f362a4b36a791c"}, @@ -1349,7 +1423,7 @@ gitdb = ">=4.0.1,<5" [package.extras] doc = ["sphinx (>=7.1.2,<7.2)", "sphinx-autodoc-typehints", "sphinx_rtd_theme"] -test = ["coverage[toml]", "ddt (>=1.1.1,!=1.4.3)", "mock", "mypy", "pre-commit", "pytest (>=7.3.1)", "pytest-cov", "pytest-instafail", "pytest-mock", "pytest-sugar", "typing-extensions"] +test = ["coverage[toml]", "ddt (>=1.1.1,!=1.4.3)", "mock ; python_version < \"3.8\"", "mypy", "pre-commit", "pytest (>=7.3.1)", "pytest-cov", "pytest-instafail", "pytest-mock", "pytest-sugar", "typing-extensions ; python_version < \"3.11\""] [[package]] name = "google-api-core" @@ -1357,6 +1431,7 @@ version = "2.25.1" description = "Google API client core library" optional = false python-versions = ">=3.7" +groups = ["main"] files = [ {file = "google_api_core-2.25.1-py3-none-any.whl", hash = "sha256:8a2a56c1fef82987a524371f99f3bd0143702fecc670c72e600c1cda6bf8dbb7"}, {file = "google_api_core-2.25.1.tar.gz", hash = "sha256:d2aaa0b13c78c61cb3f4282c464c046e45fbd75755683c9c525e6e8f7ed0a5e8"}, @@ -1374,7 +1449,7 @@ requests = ">=2.18.0,<3.0.0" [package.extras] async-rest = ["google-auth[aiohttp] (>=2.35.0,<3.0.0)"] -grpc = ["grpcio (>=1.33.2,<2.0.0)", "grpcio (>=1.49.1,<2.0.0)", "grpcio-status (>=1.33.2,<2.0.0)", "grpcio-status (>=1.49.1,<2.0.0)"] +grpc = ["grpcio (>=1.33.2,<2.0.0)", "grpcio (>=1.49.1,<2.0.0) ; python_version >= \"3.11\"", "grpcio-status (>=1.33.2,<2.0.0)", "grpcio-status (>=1.49.1,<2.0.0) ; python_version >= \"3.11\""] grpcgcp = ["grpcio-gcp (>=0.2.2,<1.0.0)"] grpcio-gcp = ["grpcio-gcp (>=0.2.2,<1.0.0)"] @@ -1384,6 +1459,7 @@ version = "2.177.0" description = "Google API Client Library for Python" optional = false python-versions = ">=3.7" +groups = ["main"] files = [ {file = "google_api_python_client-2.177.0-py3-none-any.whl", hash = "sha256:f2f50f11105ab883eb9b6cf38ec54ea5fd4b429249f76444bec90deba5be79b3"}, {file = "google_api_python_client-2.177.0.tar.gz", hash = "sha256:9ffd2b57d68f5afa7e6ac64e2c440534eaa056cbb394812a62ff94723c31b50e"}, @@ -1402,6 +1478,7 @@ version = "2.40.3" description = "Google Authentication Library" optional = false python-versions = ">=3.7" +groups = ["main"] files = [ {file = "google_auth-2.40.3-py2.py3-none-any.whl", hash = "sha256:1370d4593e86213563547f97a92752fc658456fe4514c809544f330fed45a7ca"}, {file = "google_auth-2.40.3.tar.gz", hash = "sha256:500c3a29adedeb36ea9cf24b8d10858e152f2412e3ca37829b3fa18e33d63b77"}, @@ -1415,11 +1492,11 @@ rsa = ">=3.1.4,<5" [package.extras] aiohttp = ["aiohttp (>=3.6.2,<4.0.0)", "requests (>=2.20.0,<3.0.0)"] enterprise-cert = ["cryptography", "pyopenssl"] -pyjwt = ["cryptography (<39.0.0)", "cryptography (>=38.0.3)", "pyjwt (>=2.0)"] -pyopenssl = ["cryptography (<39.0.0)", "cryptography (>=38.0.3)", "pyopenssl (>=20.0.0)"] +pyjwt = ["cryptography (<39.0.0) ; python_version < \"3.8\"", "cryptography (>=38.0.3)", "pyjwt (>=2.0)"] +pyopenssl = ["cryptography (<39.0.0) ; python_version < \"3.8\"", "cryptography (>=38.0.3)", "pyopenssl (>=20.0.0)"] reauth = ["pyu2f (>=0.1.5)"] requests = ["requests (>=2.20.0,<3.0.0)"] -testing = ["aiohttp (<3.10.0)", "aiohttp (>=3.6.2,<4.0.0)", "aioresponses", "cryptography (<39.0.0)", "cryptography (>=38.0.3)", "flask", "freezegun", "grpcio", "mock", "oauth2client", "packaging", "pyjwt (>=2.0)", "pyopenssl (<24.3.0)", "pyopenssl (>=20.0.0)", "pytest", "pytest-asyncio", "pytest-cov", "pytest-localserver", "pyu2f (>=0.1.5)", "requests (>=2.20.0,<3.0.0)", "responses", "urllib3"] +testing = ["aiohttp (<3.10.0)", "aiohttp (>=3.6.2,<4.0.0)", "aioresponses", "cryptography (<39.0.0) ; python_version < \"3.8\"", "cryptography (>=38.0.3)", "flask", "freezegun", "grpcio", "mock", "oauth2client", "packaging", "pyjwt (>=2.0)", "pyopenssl (<24.3.0)", "pyopenssl (>=20.0.0)", "pytest", "pytest-asyncio", "pytest-cov", "pytest-localserver", "pyu2f (>=0.1.5)", "requests (>=2.20.0,<3.0.0)", "responses", "urllib3"] urllib3 = ["packaging", "urllib3"] [[package]] @@ -1428,6 +1505,7 @@ version = "0.2.0" description = "Google Authentication Library: httplib2 transport" optional = false python-versions = "*" +groups = ["main"] files = [ {file = "google-auth-httplib2-0.2.0.tar.gz", hash = "sha256:38aa7badf48f974f1eb9861794e9c0cb2a0511a4ec0679b1f886d108f5640e05"}, {file = "google_auth_httplib2-0.2.0-py2.py3-none-any.whl", hash = "sha256:b65a0a2123300dd71281a7bf6e64d65a0759287df52729bdd1ae2e47dc311a3d"}, @@ -1443,6 +1521,7 @@ version = "1.70.0" description = "Common protobufs used in Google APIs" optional = false python-versions = ">=3.7" +groups = ["main"] files = [ {file = "googleapis_common_protos-1.70.0-py3-none-any.whl", hash = "sha256:b8bfcca8c25a2bb253e0e0b0adaf8c00773e5e6af6fd92397576680b807e0fd8"}, {file = "googleapis_common_protos-1.70.0.tar.gz", hash = "sha256:0e1b44e0ea153e6594f9f394fef15193a68aaaea2d843f83e2742717ca753257"}, @@ -1460,6 +1539,7 @@ version = "2.12.3" description = "Python Client Library for Supabase Auth" optional = false python-versions = "<4.0,>=3.9" +groups = ["main"] files = [ {file = "gotrue-2.12.3-py3-none-any.whl", hash = "sha256:b1a3c6a5fe3f92e854a026c4c19de58706a96fd5fbdcc3d620b2802f6a46a26b"}, {file = "gotrue-2.12.3.tar.gz", hash = "sha256:f874cf9d0b2f0335bfbd0d6e29e3f7aff79998cd1c14d2ad814db8c06cee3852"}, @@ -1476,6 +1556,7 @@ version = "0.16.0" description = "A pure-Python, bring-your-own-I/O implementation of HTTP/1.1" optional = false python-versions = ">=3.8" +groups = ["main"] files = [ {file = "h11-0.16.0-py3-none-any.whl", hash = "sha256:63cf8bbe7522de3bf65932fda1d9c2772064ffb3dae62d55932da54b31cb6c86"}, {file = "h11-0.16.0.tar.gz", hash = "sha256:4e35b956cf45792e4caa5885e69fba00bdbc6ffafbfa020300e549b208ee5ff1"}, @@ -1487,6 +1568,7 @@ version = "4.2.0" description = "Pure-Python HTTP/2 protocol implementation" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "h2-4.2.0-py3-none-any.whl", hash = "sha256:479a53ad425bb29af087f3458a61d30780bc818e4ebcf01f0b536ba916462ed0"}, {file = "h2-4.2.0.tar.gz", hash = "sha256:c8a52129695e88b1a0578d8d2cc6842bbd79128ac685463b887ee278126ad01f"}, @@ -1502,6 +1584,8 @@ version = "1.1.5" description = "Fast transfer of large files with the Hugging Face Hub." optional = false python-versions = ">=3.8" +groups = ["main"] +markers = "platform_machine == \"x86_64\" or platform_machine == \"amd64\" or platform_machine == \"arm64\" or platform_machine == \"aarch64\"" files = [ {file = "hf_xet-1.1.5-cp37-abi3-macosx_10_12_x86_64.whl", hash = "sha256:f52c2fa3635b8c37c7764d8796dfa72706cc4eded19d638331161e82b0792e23"}, {file = "hf_xet-1.1.5-cp37-abi3-macosx_11_0_arm64.whl", hash = "sha256:9fa6e3ee5d61912c4a113e0708eaaef987047616465ac7aa30f7121a48fc1af8"}, @@ -1522,6 +1606,7 @@ version = "4.1.0" description = "Pure-Python HPACK header encoding" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "hpack-4.1.0-py3-none-any.whl", hash = "sha256:157ac792668d995c657d93111f46b4535ed114f0c9c8d672271bbec7eae1b496"}, {file = "hpack-4.1.0.tar.gz", hash = "sha256:ec5eca154f7056aa06f196a557655c5b009b382873ac8d1e66e79e87535f1dca"}, @@ -1533,6 +1618,7 @@ version = "1.0.9" description = "A minimal low-level HTTP client." optional = false python-versions = ">=3.8" +groups = ["main"] files = [ {file = "httpcore-1.0.9-py3-none-any.whl", hash = "sha256:2d400746a40668fc9dec9810239072b40b4484b640a8c38fd654a024c7a1bf55"}, {file = "httpcore-1.0.9.tar.gz", hash = "sha256:6e34463af53fd2ab5d807f399a9b45ea31c3dfa2276f15a2c3f00afff6e176e8"}, @@ -1554,6 +1640,7 @@ version = "0.22.0" description = "A comprehensive HTTP client library." optional = false python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*" +groups = ["main"] files = [ {file = "httplib2-0.22.0-py3-none-any.whl", hash = "sha256:14ae0a53c1ba8f3d37e9e27cf37eabb0fb9980f435ba405d546948b009dd64dc"}, {file = "httplib2-0.22.0.tar.gz", hash = "sha256:d7a10bc5ef5ab08322488bde8c726eeee5c8618723fdb399597ec58f3d82df81"}, @@ -1568,6 +1655,7 @@ version = "0.28.1" description = "The next generation HTTP client." optional = false python-versions = ">=3.8" +groups = ["main"] files = [ {file = "httpx-0.28.1-py3-none-any.whl", hash = "sha256:d909fcccc110f8c7faf814ca82a9a4d816bc5a6dbfea25d6591d6985b8ba59ad"}, {file = "httpx-0.28.1.tar.gz", hash = "sha256:75e98c5f16b0f35b567856f597f06ff2270a374470a5c2392242528e3e3e42fc"}, @@ -1581,7 +1669,7 @@ httpcore = "==1.*" idna = "*" [package.extras] -brotli = ["brotli", "brotlicffi"] +brotli = ["brotli ; platform_python_implementation == \"CPython\"", "brotlicffi ; platform_python_implementation != \"CPython\""] cli = ["click (==8.*)", "pygments (==2.*)", "rich (>=10,<14)"] http2 = ["h2 (>=3,<5)"] socks = ["socksio (==1.*)"] @@ -1593,6 +1681,7 @@ version = "0.4.1" description = "Consume Server-Sent Event (SSE) messages with HTTPX." optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "httpx_sse-0.4.1-py3-none-any.whl", hash = "sha256:cba42174344c3a5b06f255ce65b350880f962d99ead85e776f23c6618a377a37"}, {file = "httpx_sse-0.4.1.tar.gz", hash = "sha256:8f44d34414bc7b21bf3602713005c5df4917884f76072479b21f68befa4ea26e"}, @@ -1604,6 +1693,7 @@ version = "0.34.3" description = "Client library to download and publish models, datasets and other repos on the huggingface.co hub" optional = false python-versions = ">=3.8.0" +groups = ["main"] files = [ {file = "huggingface_hub-0.34.3-py3-none-any.whl", hash = "sha256:5444550099e2d86e68b2898b09e85878fbd788fc2957b506c6a79ce060e39492"}, {file = "huggingface_hub-0.34.3.tar.gz", hash = "sha256:d58130fd5aa7408480681475491c0abd7e835442082fbc3ef4d45b6c39f83853"}, @@ -1620,16 +1710,16 @@ tqdm = ">=4.42.1" typing-extensions = ">=3.7.4.3" [package.extras] -all = ["InquirerPy (==0.3.4)", "Jinja2", "Pillow", "aiohttp", "authlib (>=1.3.2)", "fastapi", "gradio (>=4.0.0)", "httpx", "itsdangerous", "jedi", "libcst (>=1.4.0)", "mypy (==1.15.0)", "mypy (>=1.14.1,<1.15.0)", "numpy", "pytest (>=8.1.1,<8.2.2)", "pytest-asyncio", "pytest-cov", "pytest-env", "pytest-mock", "pytest-rerunfailures", "pytest-vcr", "pytest-xdist", "ruff (>=0.9.0)", "soundfile", "types-PyYAML", "types-requests", "types-simplejson", "types-toml", "types-tqdm", "types-urllib3", "typing-extensions (>=4.8.0)", "urllib3 (<2.0)"] +all = ["InquirerPy (==0.3.4)", "Jinja2", "Pillow", "aiohttp", "authlib (>=1.3.2)", "fastapi", "gradio (>=4.0.0)", "httpx", "itsdangerous", "jedi", "libcst (>=1.4.0)", "mypy (==1.15.0) ; python_version >= \"3.9\"", "mypy (>=1.14.1,<1.15.0) ; python_version == \"3.8\"", "numpy", "pytest (>=8.1.1,<8.2.2)", "pytest-asyncio", "pytest-cov", "pytest-env", "pytest-mock", "pytest-rerunfailures", "pytest-vcr", "pytest-xdist", "ruff (>=0.9.0)", "soundfile", "types-PyYAML", "types-requests", "types-simplejson", "types-toml", "types-tqdm", "types-urllib3", "typing-extensions (>=4.8.0)", "urllib3 (<2.0)"] cli = ["InquirerPy (==0.3.4)"] -dev = ["InquirerPy (==0.3.4)", "Jinja2", "Pillow", "aiohttp", "authlib (>=1.3.2)", "fastapi", "gradio (>=4.0.0)", "httpx", "itsdangerous", "jedi", "libcst (>=1.4.0)", "mypy (==1.15.0)", "mypy (>=1.14.1,<1.15.0)", "numpy", "pytest (>=8.1.1,<8.2.2)", "pytest-asyncio", "pytest-cov", "pytest-env", "pytest-mock", "pytest-rerunfailures", "pytest-vcr", "pytest-xdist", "ruff (>=0.9.0)", "soundfile", "types-PyYAML", "types-requests", "types-simplejson", "types-toml", "types-tqdm", "types-urllib3", "typing-extensions (>=4.8.0)", "urllib3 (<2.0)"] +dev = ["InquirerPy (==0.3.4)", "Jinja2", "Pillow", "aiohttp", "authlib (>=1.3.2)", "fastapi", "gradio (>=4.0.0)", "httpx", "itsdangerous", "jedi", "libcst (>=1.4.0)", "mypy (==1.15.0) ; python_version >= \"3.9\"", "mypy (>=1.14.1,<1.15.0) ; python_version == \"3.8\"", "numpy", "pytest (>=8.1.1,<8.2.2)", "pytest-asyncio", "pytest-cov", "pytest-env", "pytest-mock", "pytest-rerunfailures", "pytest-vcr", "pytest-xdist", "ruff (>=0.9.0)", "soundfile", "types-PyYAML", "types-requests", "types-simplejson", "types-toml", "types-tqdm", "types-urllib3", "typing-extensions (>=4.8.0)", "urllib3 (<2.0)"] fastai = ["fastai (>=2.4)", "fastcore (>=1.3.27)", "toml"] hf-transfer = ["hf-transfer (>=0.1.4)"] hf-xet = ["hf-xet (>=1.1.2,<2.0.0)"] inference = ["aiohttp"] mcp = ["aiohttp", "mcp (>=1.8.0)", "typer"] oauth = ["authlib (>=1.3.2)", "fastapi", "httpx", "itsdangerous"] -quality = ["libcst (>=1.4.0)", "mypy (==1.15.0)", "mypy (>=1.14.1,<1.15.0)", "ruff (>=0.9.0)"] +quality = ["libcst (>=1.4.0)", "mypy (==1.15.0) ; python_version >= \"3.9\"", "mypy (>=1.14.1,<1.15.0) ; python_version == \"3.8\"", "ruff (>=0.9.0)"] tensorflow = ["graphviz", "pydot", "tensorflow"] tensorflow-testing = ["keras (<3.0)", "tensorflow"] testing = ["InquirerPy (==0.3.4)", "Jinja2", "Pillow", "aiohttp", "authlib (>=1.3.2)", "fastapi", "gradio (>=4.0.0)", "httpx", "itsdangerous", "jedi", "numpy", "pytest (>=8.1.1,<8.2.2)", "pytest-asyncio", "pytest-cov", "pytest-env", "pytest-mock", "pytest-rerunfailures", "pytest-vcr", "pytest-xdist", "soundfile", "urllib3 (<2.0)"] @@ -1642,6 +1732,7 @@ version = "4.12.3" description = "Python humanize utilities" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "humanize-4.12.3-py3-none-any.whl", hash = "sha256:2cbf6370af06568fa6d2da77c86edb7886f3160ecd19ee1ffef07979efc597f6"}, {file = "humanize-4.12.3.tar.gz", hash = "sha256:8430be3a615106fdfceb0b2c1b41c4c98c6b0fc5cc59663a5539b111dd325fb0"}, @@ -1656,6 +1747,7 @@ version = "6.1.0" description = "Pure-Python HTTP/2 framing" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "hyperframe-6.1.0-py3-none-any.whl", hash = "sha256:b03380493a519fce58ea5af42e4a42317bf9bd425596f7a0835ffce80f1a42e5"}, {file = "hyperframe-6.1.0.tar.gz", hash = "sha256:f630908a00854a7adeabd6382b43923a4c4cd4b821fcb527e6ab9e15382a3b08"}, @@ -1667,6 +1759,7 @@ version = "2.6.12" description = "File identification library for Python" optional = false python-versions = ">=3.9" +groups = ["dev"] files = [ {file = "identify-2.6.12-py2.py3-none-any.whl", hash = "sha256:ad9672d5a72e0d2ff7c5c8809b62dfa60458626352fb0eb7b55e69bdc45334a2"}, {file = "identify-2.6.12.tar.gz", hash = "sha256:d8de45749f1efb108badef65ee8386f0f7bb19a7f26185f74de6367bffbaf0e6"}, @@ -1681,6 +1774,7 @@ version = "3.10" description = "Internationalized Domain Names in Applications (IDNA)" optional = false python-versions = ">=3.6" +groups = ["main", "dev"] files = [ {file = "idna-3.10-py3-none-any.whl", hash = "sha256:946d195a0d259cbba61165e88e65941f16e9b36ea6ddb97f00452bae8b1287d3"}, {file = "idna-3.10.tar.gz", hash = "sha256:12f65c9b470abda6dc35cf8e63cc574b1c52b11df2c86030af0ac09b01b13ea9"}, @@ -1695,6 +1789,7 @@ version = "8.7.0" description = "Read metadata from Python packages" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "importlib_metadata-8.7.0-py3-none-any.whl", hash = "sha256:e5dd1551894c77868a30651cef00984d50e1002d06942a7101d34870c5f02afd"}, {file = "importlib_metadata-8.7.0.tar.gz", hash = "sha256:d13b81ad223b890aa16c5471f2ac3056cf76c5f10f82d6f9292f0b415f389000"}, @@ -1704,12 +1799,12 @@ files = [ zipp = ">=3.20" [package.extras] -check = ["pytest-checkdocs (>=2.4)", "pytest-ruff (>=0.2.1)"] +check = ["pytest-checkdocs (>=2.4)", "pytest-ruff (>=0.2.1) ; sys_platform != \"cygwin\""] cover = ["pytest-cov"] doc = ["furo", "jaraco.packaging (>=9.3)", "jaraco.tidelift (>=1.4)", "rst.linker (>=1.9)", "sphinx (>=3.5)", "sphinx-lint"] enabler = ["pytest-enabler (>=2.2)"] perf = ["ipython"] -test = ["flufl.flake8", "importlib_resources (>=1.3)", "jaraco.test (>=5.4)", "packaging", "pyfakefs", "pytest (>=6,!=8.1.*)", "pytest-perf (>=0.9.2)"] +test = ["flufl.flake8", "importlib_resources (>=1.3) ; python_version < \"3.9\"", "jaraco.test (>=5.4)", "packaging", "pyfakefs", "pytest (>=6,!=8.1.*)", "pytest-perf (>=0.9.2)"] type = ["pytest-mypy"] [[package]] @@ -1718,6 +1813,7 @@ version = "2.1.0" description = "brain-dead simple config-ini parsing" optional = false python-versions = ">=3.8" +groups = ["main", "dev"] files = [ {file = "iniconfig-2.1.0-py3-none-any.whl", hash = "sha256:9deba5723312380e77435581c6bf4935c94cbfab9b1ed33ef8d238ea168eb760"}, {file = "iniconfig-2.1.0.tar.gz", hash = "sha256:3abbd2e30b36733fee78f9c7f7308f2d0050e88f0087fd25c2645f63c773e1c7"}, @@ -1729,6 +1825,7 @@ version = "0.7.2" description = "An ISO 8601 date/time/duration parser and formatter" optional = false python-versions = ">=3.7" +groups = ["main"] files = [ {file = "isodate-0.7.2-py3-none-any.whl", hash = "sha256:28009937d8031054830160fce6d409ed342816b543597cece116d966c6d99e15"}, {file = "isodate-0.7.2.tar.gz", hash = "sha256:4cd1aa0f43ca76f4a6c6c0292a85f40b35ec2e43e315b59f06e6d32171a953e6"}, @@ -1740,6 +1837,7 @@ version = "3.1.6" description = "A very fast and expressive template engine." optional = false python-versions = ">=3.7" +groups = ["main", "dev"] files = [ {file = "jinja2-3.1.6-py3-none-any.whl", hash = "sha256:85ece4451f492d0c13c5dd7c13a64681a86afae63a5f347908daf103ce6d2f67"}, {file = "jinja2-3.1.6.tar.gz", hash = "sha256:0137fb05990d35f1275a587e9aee6d56da821fc83491a0fb838183be43f66d6d"}, @@ -1757,6 +1855,7 @@ version = "0.10.0" description = "Fast iterable JSON parser." optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "jiter-0.10.0-cp310-cp310-macosx_10_12_x86_64.whl", hash = "sha256:cd2fb72b02478f06a900a5782de2ef47e0396b3e1f7d5aba30daeb1fce66f303"}, {file = "jiter-0.10.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:32bb468e3af278f095d3fa5b90314728a6916d89ba3d0ffb726dd9bf7367285e"}, @@ -1843,6 +1942,7 @@ version = "1.0.1" description = "JSON Matching Expressions" optional = false python-versions = ">=3.7" +groups = ["main"] files = [ {file = "jmespath-1.0.1-py3-none-any.whl", hash = "sha256:02e2e4cc71b5bcab88332eebf907519190dd9e6e82107fa7f83b1003a6252980"}, {file = "jmespath-1.0.1.tar.gz", hash = "sha256:90261b206d6defd58fdd5e85f478bf633a2901798906be2ad389150c5c60edbe"}, @@ -1854,6 +1954,7 @@ version = "4.25.0" description = "An implementation of JSON Schema validation for Python" optional = false python-versions = ">=3.9" +groups = ["main", "dev"] files = [ {file = "jsonschema-4.25.0-py3-none-any.whl", hash = "sha256:24c2e8da302de79c8b9382fee3e76b355e44d2a4364bb207159ce10b517bd716"}, {file = "jsonschema-4.25.0.tar.gz", hash = "sha256:e63acf5c11762c0e6672ffb61482bdf57f0876684d8d249c0fe2d730d48bc55f"}, @@ -1875,6 +1976,7 @@ version = "2025.4.1" description = "The JSON Schema meta-schemas and vocabularies, exposed as a Registry" optional = false python-versions = ">=3.9" +groups = ["main", "dev"] files = [ {file = "jsonschema_specifications-2025.4.1-py3-none-any.whl", hash = "sha256:4653bffbd6584f7de83a67e0d620ef16900b390ddc7939d56684d6c81e33f1af"}, {file = "jsonschema_specifications-2025.4.1.tar.gz", hash = "sha256:630159c9f4dbea161a6a2205c3011cc4f18ff381b189fff48bb39b9bf26ae608"}, @@ -1889,6 +1991,7 @@ version = "32.0.1" description = "Kubernetes python client" optional = false python-versions = ">=3.6" +groups = ["main"] files = [ {file = "kubernetes-32.0.1-py2.py3-none-any.whl", hash = "sha256:35282ab8493b938b08ab5526c7ce66588232df00ef5e1dbe88a419107dc10998"}, {file = "kubernetes-32.0.1.tar.gz", hash = "sha256:42f43d49abd437ada79a79a16bd48a604d3471a117a8347e87db693f2ba0ba28"}, @@ -1916,6 +2019,7 @@ version = "1.74.7" description = "Library to easily interface with LLM API providers" optional = false python-versions = "!=2.7.*,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*,!=3.5.*,!=3.6.*,!=3.7.*,>=3.8" +groups = ["main"] files = [ {file = "litellm-1.74.7-py3-none-any.whl", hash = "sha256:d630785faf07813cf0d5e9fb0bb84aaa18aa728297858c58c56f34c0b9190df1"}, {file = "litellm-1.74.7.tar.gz", hash = "sha256:53b809a342154d8543ea96422cf962cd5ea9df293f83dab0cc63b27baadf0ece"}, @@ -1936,8 +2040,8 @@ tokenizers = "*" [package.extras] caching = ["diskcache (>=5.6.1,<6.0.0)"] -extra-proxy = ["azure-identity (>=1.15.0,<2.0.0)", "azure-keyvault-secrets (>=4.8.0,<5.0.0)", "google-cloud-kms (>=2.21.3,<3.0.0)", "prisma (==0.11.0)", "redisvl (>=0.4.1,<0.5.0)", "resend (>=0.8.0,<0.9.0)"] -proxy = ["PyJWT (>=2.8.0,<3.0.0)", "apscheduler (>=3.10.4,<4.0.0)", "azure-identity (>=1.15.0,<2.0.0)", "azure-storage-blob (>=12.25.1,<13.0.0)", "backoff", "boto3 (==1.34.34)", "cryptography (>=43.0.1,<44.0.0)", "fastapi (>=0.115.5,<0.116.0)", "fastapi-sso (>=0.16.0,<0.17.0)", "gunicorn (>=23.0.0,<24.0.0)", "litellm-enterprise (==0.1.15)", "litellm-proxy-extras (==0.2.11)", "mcp (==1.10.0)", "orjson (>=3.9.7,<4.0.0)", "pynacl (>=1.5.0,<2.0.0)", "python-multipart (>=0.0.18,<0.0.19)", "pyyaml (>=6.0.1,<7.0.0)", "rich (==13.7.1)", "rq", "uvicorn (>=0.29.0,<0.30.0)", "uvloop (>=0.21.0,<0.22.0)", "websockets (>=13.1.0,<14.0.0)"] +extra-proxy = ["azure-identity (>=1.15.0,<2.0.0)", "azure-keyvault-secrets (>=4.8.0,<5.0.0)", "google-cloud-kms (>=2.21.3,<3.0.0)", "prisma (==0.11.0)", "redisvl (>=0.4.1,<0.5.0) ; python_version >= \"3.9\" and python_version < \"3.14\"", "resend (>=0.8.0,<0.9.0)"] +proxy = ["PyJWT (>=2.8.0,<3.0.0)", "apscheduler (>=3.10.4,<4.0.0)", "azure-identity (>=1.15.0,<2.0.0)", "azure-storage-blob (>=12.25.1,<13.0.0)", "backoff", "boto3 (==1.34.34)", "cryptography (>=43.0.1,<44.0.0)", "fastapi (>=0.115.5,<0.116.0)", "fastapi-sso (>=0.16.0,<0.17.0)", "gunicorn (>=23.0.0,<24.0.0)", "litellm-enterprise (==0.1.15)", "litellm-proxy-extras (==0.2.11)", "mcp (==1.10.0) ; python_version >= \"3.10\"", "orjson (>=3.9.7,<4.0.0)", "pynacl (>=1.5.0,<2.0.0)", "python-multipart (>=0.0.18,<0.0.19)", "pyyaml (>=6.0.1,<7.0.0)", "rich (==13.7.1)", "rq", "uvicorn (>=0.29.0,<0.30.0)", "uvloop (>=0.21.0,<0.22.0) ; sys_platform != \"win32\"", "websockets (>=13.1.0,<14.0.0)"] utils = ["numpydoc"] [[package]] @@ -1946,6 +2050,7 @@ version = "3.8.2" description = "Python implementation of John Gruber's Markdown." optional = false python-versions = ">=3.9" +groups = ["main", "dev"] files = [ {file = "markdown-3.8.2-py3-none-any.whl", hash = "sha256:5c83764dbd4e00bdd94d85a19b8d55ccca20fe35b2e678a1422b380324dd5f24"}, {file = "markdown-3.8.2.tar.gz", hash = "sha256:247b9a70dd12e27f67431ce62523e675b866d254f900c4fe75ce3dda62237c45"}, @@ -1961,6 +2066,7 @@ version = "3.0.0" description = "Python port of markdown-it. Markdown parsing, done right!" optional = false python-versions = ">=3.8" +groups = ["main"] files = [ {file = "markdown-it-py-3.0.0.tar.gz", hash = "sha256:e3f60a94fa066dc52ec76661e37c851cb232d92f9886b15cb560aaada2df8feb"}, {file = "markdown_it_py-3.0.0-py3-none-any.whl", hash = "sha256:355216845c60bd96232cd8d8c40e8f9765cc86f46880e43a8fd22dc1a1a8cab1"}, @@ -1985,6 +2091,7 @@ version = "1.1.0" description = "Convert HTML to markdown." optional = false python-versions = "*" +groups = ["main"] files = [ {file = "markdownify-1.1.0-py3-none-any.whl", hash = "sha256:32a5a08e9af02c8a6528942224c91b933b4bd2c7d078f9012943776fc313eeef"}, {file = "markdownify-1.1.0.tar.gz", hash = "sha256:449c0bbbf1401c5112379619524f33b63490a8fa479456d41de9dc9e37560ebd"}, @@ -2000,6 +2107,7 @@ version = "3.0.2" description = "Safely add untrusted strings to HTML/XML markup." optional = false python-versions = ">=3.9" +groups = ["main", "dev"] files = [ {file = "MarkupSafe-3.0.2-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:7e94c425039cde14257288fd61dcfb01963e658efbc0ff54f5306b06054700f8"}, {file = "MarkupSafe-3.0.2-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:9e2d922824181480953426608b81967de705c3cef4d1af983af849d7bd619158"}, @@ -2070,6 +2178,7 @@ version = "1.12.2" description = "Model Context Protocol SDK" optional = false python-versions = ">=3.10" +groups = ["main"] files = [ {file = "mcp-1.12.2-py3-none-any.whl", hash = "sha256:b86d584bb60193a42bd78aef01882c5c42d614e416cbf0480149839377ab5a5f"}, {file = "mcp-1.12.2.tar.gz", hash = "sha256:a4b7c742c50ce6ed6d6a6c096cca0e3893f5aecc89a59ed06d47c4e6ba41edcc"}, @@ -2099,6 +2208,7 @@ version = "0.1.2" description = "Markdown URL utilities" optional = false python-versions = ">=3.7" +groups = ["main"] files = [ {file = "mdurl-0.1.2-py3-none-any.whl", hash = "sha256:84008a41e51615a49fc9966191ff91509e3c40b939176e643fd50a5c2196b8f8"}, {file = "mdurl-0.1.2.tar.gz", hash = "sha256:bb413d29f5eea38f31dd4754dd7377d4465116fb207585f97bf925588687c1ba"}, @@ -2110,6 +2220,7 @@ version = "1.3.4" description = "A deep merge function for 🐍." optional = false python-versions = ">=3.6" +groups = ["dev"] files = [ {file = "mergedeep-1.3.4-py3-none-any.whl", hash = "sha256:70775750742b25c0d8f36c55aed03d24c3384d17c951b3175d898bd778ef0307"}, {file = "mergedeep-1.3.4.tar.gz", hash = "sha256:0096d52e9dad9939c3d975a774666af186eda617e6ca84df4c94dec30004f2a8"}, @@ -2121,6 +2232,7 @@ version = "1.6.1" description = "Project documentation with Markdown." optional = false python-versions = ">=3.8" +groups = ["dev"] files = [ {file = "mkdocs-1.6.1-py3-none-any.whl", hash = "sha256:db91759624d1647f3f34aa0c3f327dd2601beae39a366d6e064c03468d35c20e"}, {file = "mkdocs-1.6.1.tar.gz", hash = "sha256:7b432f01d928c084353ab39c57282f29f92136665bdd6abf7c1ec8d822ef86f2"}, @@ -2143,7 +2255,7 @@ watchdog = ">=2.0" [package.extras] i18n = ["babel (>=2.9.0)"] -min-versions = ["babel (==2.9.0)", "click (==7.0)", "colorama (==0.4)", "ghp-import (==1.0)", "importlib-metadata (==4.4)", "jinja2 (==2.11.1)", "markdown (==3.3.6)", "markupsafe (==2.0.1)", "mergedeep (==1.3.4)", "mkdocs-get-deps (==0.2.0)", "packaging (==20.5)", "pathspec (==0.11.1)", "pyyaml (==5.1)", "pyyaml-env-tag (==0.1)", "watchdog (==2.0)"] +min-versions = ["babel (==2.9.0)", "click (==7.0)", "colorama (==0.4) ; platform_system == \"Windows\"", "ghp-import (==1.0)", "importlib-metadata (==4.4) ; python_version < \"3.10\"", "jinja2 (==2.11.1)", "markdown (==3.3.6)", "markupsafe (==2.0.1)", "mergedeep (==1.3.4)", "mkdocs-get-deps (==0.2.0)", "packaging (==20.5)", "pathspec (==0.11.1)", "pyyaml (==5.1)", "pyyaml-env-tag (==0.1)", "watchdog (==2.0)"] [[package]] name = "mkdocs-get-deps" @@ -2151,6 +2263,7 @@ version = "0.2.0" description = "MkDocs extension that lists all dependencies according to a mkdocs.yml file" optional = false python-versions = ">=3.8" +groups = ["dev"] files = [ {file = "mkdocs_get_deps-0.2.0-py3-none-any.whl", hash = "sha256:2bf11d0b133e77a0dd036abeeb06dec8775e46efa526dc70667d8863eefc6134"}, {file = "mkdocs_get_deps-0.2.0.tar.gz", hash = "sha256:162b3d129c7fad9b19abfdcb9c1458a651628e4b1dea628ac68790fb3061c60c"}, @@ -2167,6 +2280,7 @@ version = "0.4.0" description = "MkDocs plugin supports image lightbox with GLightbox." optional = false python-versions = "*" +groups = ["dev"] files = [ {file = "mkdocs-glightbox-0.4.0.tar.gz", hash = "sha256:392b34207bf95991071a16d5f8916d1d2f2cd5d5bb59ae2997485ccd778c70d9"}, {file = "mkdocs_glightbox-0.4.0-py3-none-any.whl", hash = "sha256:e0107beee75d3eb7380ac06ea2d6eac94c999eaa49f8c3cbab0e7be2ac006ccf"}, @@ -2178,6 +2292,7 @@ version = "9.6.16" description = "Documentation that simply works" optional = false python-versions = ">=3.8" +groups = ["dev"] files = [ {file = "mkdocs_material-9.6.16-py3-none-any.whl", hash = "sha256:8d1a1282b892fe1fdf77bfeb08c485ba3909dd743c9ba69a19a40f637c6ec18c"}, {file = "mkdocs_material-9.6.16.tar.gz", hash = "sha256:d07011df4a5c02ee0877496d9f1bfc986cfb93d964799b032dd99fe34c0e9d19"}, @@ -2207,6 +2322,7 @@ version = "1.3.1" description = "Extension pack for Python Markdown and MkDocs Material." optional = false python-versions = ">=3.8" +groups = ["dev"] files = [ {file = "mkdocs_material_extensions-1.3.1-py3-none-any.whl", hash = "sha256:adff8b62700b25cb77b53358dad940f3ef973dd6db797907c49e3c2ef3ab4e31"}, {file = "mkdocs_material_extensions-1.3.1.tar.gz", hash = "sha256:10c9511cea88f568257f960358a467d12b970e1f7b2c0e5fb2bb48cab1928443"}, @@ -2218,6 +2334,7 @@ version = "1.33.0" description = "The Microsoft Authentication Library (MSAL) for Python library enables your app to access the Microsoft Cloud by supporting authentication of users with Microsoft Azure Active Directory accounts (AAD) and Microsoft Accounts (MSA) using industry standard OAuth2 and OpenID Connect." optional = false python-versions = ">=3.7" +groups = ["main"] files = [ {file = "msal-1.33.0-py3-none-any.whl", hash = "sha256:c0cd41cecf8eaed733ee7e3be9e040291eba53b0f262d3ae9c58f38b04244273"}, {file = "msal-1.33.0.tar.gz", hash = "sha256:836ad80faa3e25a7d71015c990ce61f704a87328b1e73bcbb0623a18cbf17510"}, @@ -2229,7 +2346,7 @@ PyJWT = {version = ">=1.0.0,<3", extras = ["crypto"]} requests = ">=2.0.0,<3" [package.extras] -broker = ["pymsalruntime (>=0.14,<0.19)", "pymsalruntime (>=0.17,<0.19)", "pymsalruntime (>=0.18,<0.19)"] +broker = ["pymsalruntime (>=0.14,<0.19) ; python_version >= \"3.6\" and platform_system == \"Windows\"", "pymsalruntime (>=0.17,<0.19) ; python_version >= \"3.8\" and platform_system == \"Darwin\"", "pymsalruntime (>=0.18,<0.19) ; python_version >= \"3.8\" and platform_system == \"Linux\""] [[package]] name = "msal-extensions" @@ -2237,6 +2354,7 @@ version = "1.3.1" description = "Microsoft Authentication Library extensions (MSAL EX) provides a persistence API that can save your data on disk, encrypted on Windows, macOS and Linux. Concurrent data access will be coordinated by a file lock mechanism." optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "msal_extensions-1.3.1-py3-none-any.whl", hash = "sha256:96d3de4d034504e969ac5e85bae8106c8373b5c6568e4c8fa7af2eca9dbe6bca"}, {file = "msal_extensions-1.3.1.tar.gz", hash = "sha256:c5b0fd10f65ef62b5f1d62f4251d51cbcaf003fcedae8c91b040a488614be1a4"}, @@ -2254,6 +2372,7 @@ version = "0.7.1" description = "AutoRest swagger generator Python client runtime." optional = false python-versions = ">=3.6" +groups = ["main"] files = [ {file = "msrest-0.7.1-py3-none-any.whl", hash = "sha256:21120a810e1233e5e6cc7fe40b474eeb4ec6f757a15d7cf86702c369f9567c32"}, {file = "msrest-0.7.1.zip", hash = "sha256:6e7661f46f3afd88b75667b7187a92829924446c7ea1d169be8c4bb7eeb788b9"}, @@ -2267,7 +2386,7 @@ requests = ">=2.16,<3.0" requests-oauthlib = ">=0.5.0" [package.extras] -async = ["aiodns", "aiohttp (>=3.0)"] +async = ["aiodns ; python_version >= \"3.5\"", "aiohttp (>=3.0) ; python_version >= \"3.5\""] [[package]] name = "multidict" @@ -2275,6 +2394,7 @@ version = "6.6.3" description = "multidict implementation" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "multidict-6.6.3-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:a2be5b7b35271f7fff1397204ba6708365e3d773579fe2a30625e16c4b4ce817"}, {file = "multidict-6.6.3-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:12f4581d2930840295c461764b9a65732ec01250b46c6b2c510d7ee68872b140"}, @@ -2397,6 +2517,7 @@ version = "1.17.1" description = "Optional static typing for Python" optional = false python-versions = ">=3.9" +groups = ["dev"] files = [ {file = "mypy-1.17.1-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:3fbe6d5555bf608c47203baa3e72dbc6ec9965b3d7c318aa9a4ca76f465bd972"}, {file = "mypy-1.17.1-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:80ef5c058b7bce08c83cac668158cb7edea692e458d21098c7d3bce35a5d43e7"}, @@ -2457,6 +2578,7 @@ version = "1.1.0" description = "Type system extensions for programs checked with the mypy type checker." optional = false python-versions = ">=3.8" +groups = ["dev"] files = [ {file = "mypy_extensions-1.1.0-py3-none-any.whl", hash = "sha256:1be4cccdb0f2482337c4743e60421de3a356cd97508abadd57d47403e94f5505"}, {file = "mypy_extensions-1.1.0.tar.gz", hash = "sha256:52e68efc3284861e772bbcd66823fde5ae21fd2fdb51c62a211403730b916558"}, @@ -2468,6 +2590,7 @@ version = "1.9.1" description = "Node.js virtual environment builder" optional = false python-versions = "!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*,!=3.5.*,!=3.6.*,>=2.7" +groups = ["dev"] files = [ {file = "nodeenv-1.9.1-py2.py3-none-any.whl", hash = "sha256:ba11c9782d29c27c70ffbdda2d7415098754709be8a7056d79a737cd901155c9"}, {file = "nodeenv-1.9.1.tar.gz", hash = "sha256:6ec12890a2dab7946721edbfbcd91f3319c6ccc9aec47be7c7e6b7011ee6645f"}, @@ -2479,6 +2602,7 @@ version = "3.3.1" description = "A generic, spec-compliant, thorough implementation of the OAuth request-signing logic" optional = false python-versions = ">=3.8" +groups = ["main"] files = [ {file = "oauthlib-3.3.1-py3-none-any.whl", hash = "sha256:88119c938d2b8fb88561af5f6ee0eec8cc8d552b7bb1f712743136eb7523b7a1"}, {file = "oauthlib-3.3.1.tar.gz", hash = "sha256:0f0f8aa759826a193cf66c12ea1af1637f87b9b4622d46e866952bb022e538c9"}, @@ -2495,6 +2619,7 @@ version = "1.98.0" description = "The official Python library for the openai API" optional = false python-versions = ">=3.8" +groups = ["main"] files = [ {file = "openai-1.98.0-py3-none-any.whl", hash = "sha256:b99b794ef92196829120e2df37647722104772d2a74d08305df9ced5f26eae34"}, {file = "openai-1.98.0.tar.gz", hash = "sha256:3ee0fcc50ae95267fd22bd1ad095ba5402098f3df2162592e68109999f685427"}, @@ -2522,6 +2647,7 @@ version = "2.8.0" description = "Python client for OpenSearch" optional = false python-versions = "<4,>=3.8" +groups = ["main"] files = [ {file = "opensearch_py-2.8.0-py3-none-any.whl", hash = "sha256:52c60fdb5d4dcf6cce3ee746c13b194529b0161e0f41268b98ab8f1624abe2fa"}, {file = "opensearch_py-2.8.0.tar.gz", hash = "sha256:6598df0bc7a003294edd0ba88a331e0793acbb8c910c43edf398791e3b2eccda"}, @@ -2546,6 +2672,7 @@ version = "25.0" description = "Core utilities for Python packages" optional = false python-versions = ">=3.8" +groups = ["main", "dev"] files = [ {file = "packaging-25.0-py3-none-any.whl", hash = "sha256:29572ef2b1f17581046b3a2227d5c611fb25ec70ca1ba8554b24b0e69331a484"}, {file = "packaging-25.0.tar.gz", hash = "sha256:d443872c98d677bf60f6a1f2f8c1cb748e8fe762d2bf9d3148b5599295b0fc4f"}, @@ -2557,6 +2684,7 @@ version = "0.5.7" description = "Divides large result sets into pages for easier browsing" optional = false python-versions = "*" +groups = ["dev"] files = [ {file = "paginate-0.5.7-py2.py3-none-any.whl", hash = "sha256:b885e2af73abcf01d9559fd5216b57ef722f8c42affbb63942377668e35c7591"}, {file = "paginate-0.5.7.tar.gz", hash = "sha256:22bd083ab41e1a8b4f3690544afb2c60c25e5c9a63a30fa2f483f6c60c8e5945"}, @@ -2572,6 +2700,7 @@ version = "0.12.1" description = "Utility library for gitignore style pattern matching of file paths." optional = false python-versions = ">=3.8" +groups = ["dev"] files = [ {file = "pathspec-0.12.1-py3-none-any.whl", hash = "sha256:a0d503e138a4c123b27490a4f7beda6a01c6f288df0e4a8b79c7eb0dc7b4cc08"}, {file = "pathspec-0.12.1.tar.gz", hash = "sha256:a482d51503a1ab33b1c67a6c3813a26953dbdc71c31dacaef9a838c4e29f5712"}, @@ -2583,6 +2712,7 @@ version = "4.3.8" description = "A small Python package for determining appropriate platform-specific dirs, e.g. a `user data dir`." optional = false python-versions = ">=3.9" +groups = ["dev"] files = [ {file = "platformdirs-4.3.8-py3-none-any.whl", hash = "sha256:ff7059bb7eb1179e2685604f4aaf157cfd9535242bd23742eadc3c13542139b4"}, {file = "platformdirs-4.3.8.tar.gz", hash = "sha256:3d512d96e16bcb959a814c9f348431070822a6496326a4be0911c40b5a74c2bc"}, @@ -2599,6 +2729,7 @@ version = "1.6.0" description = "plugin and hook calling mechanisms for python" optional = false python-versions = ">=3.9" +groups = ["main", "dev"] files = [ {file = "pluggy-1.6.0-py3-none-any.whl", hash = "sha256:e920276dd6813095e9377c0bc5566d94c932c33b27a3e3945d8389c374dd4746"}, {file = "pluggy-1.6.0.tar.gz", hash = "sha256:7dcc130b76258d33b90f61b658791dede3486c3e6bfb003ee5c9bfb396dd22f3"}, @@ -2614,6 +2745,7 @@ version = "0.9.0" description = "A fast C-implemented library for Levenshtein distance" optional = false python-versions = ">=3.8" +groups = ["dev"] files = [ {file = "polyleven-0.9.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:6e00207fbe0fcdde206b9b277cf14bb9db8801f8d303204b1572870797399974"}, {file = "polyleven-0.9.0-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:d400f255af038f77b37d5010532e0e82d07160457c8282e5b40632987ab815be"}, @@ -2678,6 +2810,7 @@ version = "1.1.1" description = "PostgREST client for Python. This library provides an ORM interface to PostgREST." optional = false python-versions = "<4.0,>=3.9" +groups = ["main"] files = [ {file = "postgrest-1.1.1-py3-none-any.whl", hash = "sha256:98a6035ee1d14288484bfe36235942c5fb2d26af6d8120dfe3efbe007859251a"}, {file = "postgrest-1.1.1.tar.gz", hash = "sha256:f3bb3e8c4602775c75c844a31f565f5f3dd584df4d36d683f0b67d01a86be322"}, @@ -2695,6 +2828,7 @@ version = "4.2.0" description = "A framework for managing and maintaining multi-language pre-commit hooks." optional = false python-versions = ">=3.9" +groups = ["dev"] files = [ {file = "pre_commit-4.2.0-py2.py3-none-any.whl", hash = "sha256:a009ca7205f1eb497d10b845e52c838a98b6cdd2102a6c8e4540e94ee75c58bd"}, {file = "pre_commit-4.2.0.tar.gz", hash = "sha256:601283b9757afd87d40c4c4a9b2b5de9637a8ea02eaff7adc2d0fb4e04841146"}, @@ -2713,6 +2847,7 @@ version = "3.0.51" description = "Library for building powerful interactive command lines in Python" optional = false python-versions = ">=3.8" +groups = ["main"] files = [ {file = "prompt_toolkit-3.0.51-py3-none-any.whl", hash = "sha256:52742911fde84e2d423e2f9a4cf1de7d7ac4e51958f648d9540e0fb8db077b07"}, {file = "prompt_toolkit-3.0.51.tar.gz", hash = "sha256:931a162e3b27fc90c86f1b48bb1fb2c528c2761475e57c9c06de13311c7b54ed"}, @@ -2727,6 +2862,7 @@ version = "0.3.2" description = "Accelerated property cache" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "propcache-0.3.2-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:22d9962a358aedbb7a2e36187ff273adeaab9743373a272976d2e348d08c7770"}, {file = "propcache-0.3.2-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:0d0fda578d1dc3f77b6b5a5dce3b9ad69a8250a891760a548df850a5e8da87f3"}, @@ -2834,6 +2970,7 @@ version = "1.26.1" description = "Beautiful, Pythonic protocol buffers" optional = false python-versions = ">=3.7" +groups = ["main"] files = [ {file = "proto_plus-1.26.1-py3-none-any.whl", hash = "sha256:13285478c2dcf2abb829db158e1047e2f1e8d63a077d94263c2b88b043c75a66"}, {file = "proto_plus-1.26.1.tar.gz", hash = "sha256:21a515a4c4c0088a773899e23c7bbade3d18f9c66c73edd4c7ee3816bc96a012"}, @@ -2851,6 +2988,7 @@ version = "6.31.1" description = "" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "protobuf-6.31.1-cp310-abi3-win32.whl", hash = "sha256:7fa17d5a29c2e04b7d90e5e32388b8bfd0e7107cd8e616feef7ed3fa6bdab5c9"}, {file = "protobuf-6.31.1-cp310-abi3-win_amd64.whl", hash = "sha256:426f59d2964864a1a366254fa703b8632dcec0790d8862d30034d8245e1cd447"}, @@ -2869,6 +3007,7 @@ version = "0.6.1" description = "Pure-Python implementation of ASN.1 types and DER/BER/CER codecs (X.208)" optional = false python-versions = ">=3.8" +groups = ["main"] files = [ {file = "pyasn1-0.6.1-py3-none-any.whl", hash = "sha256:0d632f46f2ba09143da3a8afe9e33fb6f92fa2320ab7e886e2d0f7672af84629"}, {file = "pyasn1-0.6.1.tar.gz", hash = "sha256:6f580d2bdd84365380830acf45550f2511469f673cb4a5ae3857a3170128b034"}, @@ -2880,6 +3019,7 @@ version = "0.4.2" description = "A collection of ASN.1-based protocols modules" optional = false python-versions = ">=3.8" +groups = ["main"] files = [ {file = "pyasn1_modules-0.4.2-py3-none-any.whl", hash = "sha256:29253a9207ce32b64c3ac6600edc75368f98473906e8fd1043bd6b5b1de2c14a"}, {file = "pyasn1_modules-0.4.2.tar.gz", hash = "sha256:677091de870a80aae844b1ca6134f54652fa2c8c5a52aa396440ac3106e941e6"}, @@ -2894,6 +3034,8 @@ version = "2.22" description = "C parser in Python" optional = false python-versions = ">=3.8" +groups = ["main"] +markers = "platform_python_implementation != \"PyPy\"" files = [ {file = "pycparser-2.22-py3-none-any.whl", hash = "sha256:c3702b6d3dd8c7abc1afa565d7e63d53a1d0bd86cdc24edd75470f4de499cfcc"}, {file = "pycparser-2.22.tar.gz", hash = "sha256:491c8be9c040f5390f5bf44a5b07752bd07f56edf992381b05c701439eec10f6"}, @@ -2905,6 +3047,7 @@ version = "2.11.7" description = "Data validation using Python type hints" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "pydantic-2.11.7-py3-none-any.whl", hash = "sha256:dde5df002701f6de26248661f6835bbe296a47bf73990135c7d07ce741b9623b"}, {file = "pydantic-2.11.7.tar.gz", hash = "sha256:d989c3c6cb79469287b1569f7447a17848c998458d49ebe294e975b9baf0f0db"}, @@ -2918,7 +3061,7 @@ typing-inspection = ">=0.4.0" [package.extras] email = ["email-validator (>=2.0.0)"] -timezone = ["tzdata"] +timezone = ["tzdata ; python_version >= \"3.9\" and platform_system == \"Windows\""] [[package]] name = "pydantic-core" @@ -2926,6 +3069,7 @@ version = "2.33.2" description = "Core functionality for Pydantic validation and serialization" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "pydantic_core-2.33.2-cp310-cp310-macosx_10_12_x86_64.whl", hash = "sha256:2b3d326aaef0c0399d9afffeb6367d5e26ddc24d351dbc9c636840ac355dc5d8"}, {file = "pydantic_core-2.33.2-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:0e5b2671f05ba48b94cb90ce55d8bdcaaedb8ba00cc5359f6810fc918713983d"}, @@ -3037,6 +3181,7 @@ version = "2.10.1" description = "Settings management using Pydantic" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "pydantic_settings-2.10.1-py3-none-any.whl", hash = "sha256:a60952460b99cf661dc25c29c0ef171721f98bfcb52ef8d9ea4c943d7c8cc796"}, {file = "pydantic_settings-2.10.1.tar.gz", hash = "sha256:06f0062169818d0f5524420a360d632d5857b83cffd4d42fe29597807a1614ee"}, @@ -3060,6 +3205,7 @@ version = "8.0.5" description = "The kitchen sink of Python utility libraries for doing \"stuff\" in a functional way. Based on the Lo-Dash Javascript library." optional = false python-versions = ">=3.8" +groups = ["main"] files = [ {file = "pydash-8.0.5-py3-none-any.whl", hash = "sha256:b2625f8981862e19911daa07f80ed47b315ce20d9b5eb57aaf97aaf570c3892f"}, {file = "pydash-8.0.5.tar.gz", hash = "sha256:7cc44ebfe5d362f4f5f06c74c8684143c5ac481376b059ff02570705523f9e2e"}, @@ -3077,6 +3223,7 @@ version = "2.19.2" description = "Pygments is a syntax highlighting package written in Python." optional = false python-versions = ">=3.8" +groups = ["main", "dev"] files = [ {file = "pygments-2.19.2-py3-none-any.whl", hash = "sha256:86540386c03d588bb81d44bc3928634ff26449851e99741617ecb9037ee5ec0b"}, {file = "pygments-2.19.2.tar.gz", hash = "sha256:636cb2477cec7f8952536970bc533bc43743542f70392ae026374600add5b887"}, @@ -3091,6 +3238,7 @@ version = "2.10.1" description = "JSON Web Token implementation in Python" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "PyJWT-2.10.1-py3-none-any.whl", hash = "sha256:dcdd193e30abefd5debf142f9adfcdd2b58004e644f25406ffaebd50bd98dacb"}, {file = "pyjwt-2.10.1.tar.gz", hash = "sha256:3cc5772eb20009233caf06e9d8a0577824723b44e6648ee0a2aedb6cf9381953"}, @@ -3111,6 +3259,7 @@ version = "10.16.1" description = "Extension pack for Python Markdown." optional = false python-versions = ">=3.9" +groups = ["dev"] files = [ {file = "pymdown_extensions-10.16.1-py3-none-any.whl", hash = "sha256:d6ba157a6c03146a7fb122b2b9a121300056384eafeec9c9f9e584adfdb2a32d"}, {file = "pymdown_extensions-10.16.1.tar.gz", hash = "sha256:aace82bcccba3efc03e25d584e6a22d27a8e17caa3f4dd9f207e49b787aa9a91"}, @@ -3129,6 +3278,7 @@ version = "5.2.0" description = "DB API module for ODBC" optional = false python-versions = ">=3.8" +groups = ["main"] files = [ {file = "pyodbc-5.2.0-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:eb0850e3e3782f57457feed297e220bb20c3e8fd7550d7a6b6bb96112bd9b6fe"}, {file = "pyodbc-5.2.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:0dae0fb86078c87acf135dbe5afd3c7d15d52ab0db5965c44159e84058c3e2fb"}, @@ -3175,6 +3325,7 @@ version = "3.2.3" description = "pyparsing module - Classes and methods to define and execute parsing grammars" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "pyparsing-3.2.3-py3-none-any.whl", hash = "sha256:a749938e02d6fd0b59b356ca504a24982314bb090c383e3cf201c95ef7e2bfcf"}, {file = "pyparsing-3.2.3.tar.gz", hash = "sha256:b9c13f1ab8b3b542f72e28f634bad4de758ab3ce4546e4301970ad6fa77c38be"}, @@ -3189,6 +3340,7 @@ version = "8.4.1" description = "pytest: simple powerful testing with Python" optional = false python-versions = ">=3.9" +groups = ["main", "dev"] files = [ {file = "pytest-8.4.1-py3-none-any.whl", hash = "sha256:539c70ba6fcead8e78eebbf1115e8b589e7565830d7d006a8723f19ac8a0afb7"}, {file = "pytest-8.4.1.tar.gz", hash = "sha256:7c67fd69174877359ed9371ec3af8a3d2b04741818c51e5e99cc1742251fa93c"}, @@ -3212,6 +3364,7 @@ version = "6.2.1" description = "Pytest plugin for measuring coverage." optional = false python-versions = ">=3.9" +groups = ["dev"] files = [ {file = "pytest_cov-6.2.1-py3-none-any.whl", hash = "sha256:f5bc4c23f42f1cdd23c70b1dab1bbaef4fc505ba950d53e0081d0730dd7e86d5"}, {file = "pytest_cov-6.2.1.tar.gz", hash = "sha256:25cc6cc0a5358204b8108ecedc51a9b57b34cc6b8c967cc2c01a4e00d8a67da2"}, @@ -3246,6 +3399,7 @@ version = "0.4.0" description = "Pytest session-scoped fixture that works with xdist" optional = false python-versions = ">=3.10" +groups = ["main"] files = [ {file = "pytest_shared_session_scope-0.4.0-py3-none-any.whl", hash = "sha256:583508332f0ffdf306fb50487893e4bd4f893caf21778b7b0ea9fad6767fecce"}, {file = "pytest_shared_session_scope-0.4.0.tar.gz", hash = "sha256:30da6ced4c734bb7cdbc10310da754ca9c8ae75c1384ceb3eda82fa76db647d2"}, @@ -3262,6 +3416,7 @@ version = "3.8.0" description = "pytest xdist plugin for distributed testing, most importantly across multiple CPUs" optional = false python-versions = ">=3.9" +groups = ["dev"] files = [ {file = "pytest_xdist-3.8.0-py3-none-any.whl", hash = "sha256:202ca578cfeb7370784a8c33d6d05bc6e13b4f25b5053c30a152269fd10f0b88"}, {file = "pytest_xdist-3.8.0.tar.gz", hash = "sha256:7e578125ec9bc6050861aa93f2d59f1d8d085595d6551c2c90b6f4fad8d3a9f1"}, @@ -3282,6 +3437,7 @@ version = "0.33.2" description = "python-benedict is a dict subclass with keylist/keypath/keyattr support, normalized I/O operations (base64, csv, ini, json, pickle, plist, query-string, toml, xls, xml, yaml) and many utilities... for humans, obviously." optional = false python-versions = "*" +groups = ["main"] files = [ {file = "python-benedict-0.33.2.tar.gz", hash = "sha256:662de43bffb4e127da2056447f8ddd7f6f5c89b72dd66d289cf9abd1cc2720c8"}, {file = "python_benedict-0.33.2-py3-none-any.whl", hash = "sha256:50a69b601b34d4ad7b67fe94e3266ec05046bc547a4132fe43fd8fbd41aeefaa"}, @@ -3309,6 +3465,7 @@ version = "2.9.0.post0" description = "Extensions to the standard Python datetime module" optional = false python-versions = "!=3.0.*,!=3.1.*,!=3.2.*,>=2.7" +groups = ["main", "dev"] files = [ {file = "python-dateutil-2.9.0.post0.tar.gz", hash = "sha256:37dd54208da7e1cd875388217d5e00ebd4179249f90fb72437e91a35459a0ad3"}, {file = "python_dateutil-2.9.0.post0-py2.py3-none-any.whl", hash = "sha256:a8b2bc7bffae282281c8140a97d3aa9c14da0b136dfe83f850eea9a5f7470427"}, @@ -3323,6 +3480,7 @@ version = "1.1.1" description = "Read key-value pairs from a .env file and set them as environment variables" optional = false python-versions = ">=3.9" +groups = ["main", "dev"] files = [ {file = "python_dotenv-1.1.1-py3-none-any.whl", hash = "sha256:31f23644fe2602f88ff55e1f5c79ba497e01224ee7737937930c448e4d0e24dc"}, {file = "python_dotenv-1.1.1.tar.gz", hash = "sha256:a8a6399716257f45be6a007360200409fce5cda2661e3dec71d23dc15f6189ab"}, @@ -3337,6 +3495,7 @@ version = "0.15.0" description = "high-level file-system operations for lazy devs." optional = false python-versions = "*" +groups = ["main"] files = [ {file = "python_fsutil-0.15.0-py3-none-any.whl", hash = "sha256:8ae31def522916e35caf67723b8526fe6e5fcc1e160ea2dc23c845567708ca6e"}, {file = "python_fsutil-0.15.0.tar.gz", hash = "sha256:b51d8ab7ee218314480ea251fff7fef513be4fbccfe72a5af4ff2954f8a4a2c4"}, @@ -3348,6 +3507,7 @@ version = "0.0.18" description = "A streaming multipart parser for Python" optional = false python-versions = ">=3.8" +groups = ["main"] files = [ {file = "python_multipart-0.0.18-py3-none-any.whl", hash = "sha256:efe91480f485f6a361427a541db4796f9e1591afc0fb8e7a4ba06bfbc6708996"}, {file = "python_multipart-0.0.18.tar.gz", hash = "sha256:7a68db60c8bfb82e460637fa4750727b45af1d5e2ed215593f917f64694d34fe"}, @@ -3359,6 +3519,7 @@ version = "8.0.4" description = "A Python slugify application that also handles Unicode" optional = false python-versions = ">=3.7" +groups = ["main", "dev"] files = [ {file = "python-slugify-8.0.4.tar.gz", hash = "sha256:59202371d1d05b54a9e7720c5e038f928f45daaffe41dd10822f3907b937c856"}, {file = "python_slugify-8.0.4-py2.py3-none-any.whl", hash = "sha256:276540b79961052b66b7d116620b36518847f52d5fd9e3a70164fc8c50faa6b8"}, @@ -3376,6 +3537,8 @@ version = "311" description = "Python for Window Extensions" optional = false python-versions = "*" +groups = ["main"] +markers = "sys_platform == \"win32\"" files = [ {file = "pywin32-311-cp310-cp310-win32.whl", hash = "sha256:d03ff496d2a0cd4a5893504789d4a15399133fe82517455e78bad62efbb7f0a3"}, {file = "pywin32-311-cp310-cp310-win_amd64.whl", hash = "sha256:797c2772017851984b97180b0bebe4b620bb86328e8a884bb626156295a63b3b"}, @@ -3405,6 +3568,7 @@ version = "6.0.2" description = "YAML parser and emitter for Python" optional = false python-versions = ">=3.8" +groups = ["main", "dev"] files = [ {file = "PyYAML-6.0.2-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:0a9a2848a5b7feac301353437eb7d5957887edbf81d56e903999a75a3d743086"}, {file = "PyYAML-6.0.2-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:29717114e51c84ddfba879543fb232a6ed60086602313ca38cce623c1d62cfbf"}, @@ -3467,6 +3631,7 @@ version = "1.1" description = "A custom YAML tag for referencing environment variables in YAML files." optional = false python-versions = ">=3.9" +groups = ["dev"] files = [ {file = "pyyaml_env_tag-1.1-py3-none-any.whl", hash = "sha256:17109e1a528561e32f026364712fee1264bc2ea6715120891174ed1b980d2e04"}, {file = "pyyaml_env_tag-1.1.tar.gz", hash = "sha256:2eb38b75a2d21ee0475d6d97ec19c63287a7e140231e4214969d0eac923cd7ff"}, @@ -3481,6 +3646,7 @@ version = "2.6.0" description = "" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "realtime-2.6.0-py3-none-any.whl", hash = "sha256:a0512d71044c2621455bc87d1c171739967edc161381994de54e0989ca6c348e"}, {file = "realtime-2.6.0.tar.gz", hash = "sha256:f68743cff85d3113659fa19835a868674e720465649bf833e1cd47d7da0f7bbd"}, @@ -3497,6 +3663,7 @@ version = "0.36.2" description = "JSON Referencing + Python" optional = false python-versions = ">=3.9" +groups = ["main", "dev"] files = [ {file = "referencing-0.36.2-py3-none-any.whl", hash = "sha256:e8699adbbf8b5c7de96d8ffa0eb5c158b3beafce084968e2ea8bb08c6794dcd0"}, {file = "referencing-0.36.2.tar.gz", hash = "sha256:df2e89862cd09deabbdba16944cc3f10feb6b3e6f18e902f7cc25609a34775aa"}, @@ -3513,6 +3680,7 @@ version = "2025.7.34" description = "Alternative regular expression module, to replace re." optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "regex-2025.7.34-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:d856164d25e2b3b07b779bfed813eb4b6b6ce73c2fd818d46f47c1eb5cd79bd6"}, {file = "regex-2025.7.34-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:2d15a9da5fad793e35fb7be74eec450d968e05d2e294f3e0e77ab03fa7234a83"}, @@ -3609,6 +3777,7 @@ version = "2.32.4" description = "Python HTTP for Humans." optional = false python-versions = ">=3.8" +groups = ["main", "dev"] files = [ {file = "requests-2.32.4-py3-none-any.whl", hash = "sha256:27babd3cda2a6d50b30443204ee89830707d396671944c998b5975b031ac2b2c"}, {file = "requests-2.32.4.tar.gz", hash = "sha256:27d0316682c8a29834d3264820024b62a36942083d52caf2f14c0591336d3422"}, @@ -3630,6 +3799,7 @@ version = "2.0.0" description = "OAuthlib authentication support for Requests." optional = false python-versions = ">=3.4" +groups = ["main"] files = [ {file = "requests-oauthlib-2.0.0.tar.gz", hash = "sha256:b3dffaebd884d8cd778494369603a9e7b58d29111bf6b41bdc2dcd87203af4e9"}, {file = "requests_oauthlib-2.0.0-py2.py3-none-any.whl", hash = "sha256:7dd8a5c40426b779b0868c404bdef9768deccf22749cde15852df527e6269b36"}, @@ -3648,6 +3818,7 @@ version = "0.23.3" description = "A utility library for mocking out the `requests` Python library." optional = false python-versions = ">=3.7" +groups = ["dev"] files = [ {file = "responses-0.23.3-py3-none-any.whl", hash = "sha256:e6fbcf5d82172fecc0aa1860fd91e58cbfd96cee5e96da5b63fa6eb3caa10dd3"}, {file = "responses-0.23.3.tar.gz", hash = "sha256:205029e1cb334c21cb4ec64fc7599be48b859a0fd381a42443cdd600bfe8b16a"}, @@ -3660,7 +3831,7 @@ types-PyYAML = "*" urllib3 = ">=1.25.10,<3.0" [package.extras] -tests = ["coverage (>=6.0.0)", "flake8", "mypy", "pytest (>=7.0.0)", "pytest-asyncio", "pytest-cov", "pytest-httpserver", "tomli", "tomli-w", "types-requests"] +tests = ["coverage (>=6.0.0)", "flake8", "mypy", "pytest (>=7.0.0)", "pytest-asyncio", "pytest-cov", "pytest-httpserver", "tomli ; python_version < \"3.11\"", "tomli-w", "types-requests"] [[package]] name = "rich" @@ -3668,6 +3839,7 @@ version = "13.9.4" description = "Render rich text, tables, progress bars, syntax highlighting, markdown and more to the terminal" optional = false python-versions = ">=3.8.0" +groups = ["main"] files = [ {file = "rich-13.9.4-py3-none-any.whl", hash = "sha256:6049d5e6ec054bf2779ab3358186963bac2ea89175919d699e378b99738c2a90"}, {file = "rich-13.9.4.tar.gz", hash = "sha256:439594978a49a09530cff7ebc4b5c7103ef57baf48d5ea3184f21d9a2befa098"}, @@ -3687,6 +3859,7 @@ version = "0.26.0" description = "Python bindings to Rust's persistent data structures (rpds)" optional = false python-versions = ">=3.9" +groups = ["main", "dev"] files = [ {file = "rpds_py-0.26.0-cp310-cp310-macosx_10_12_x86_64.whl", hash = "sha256:4c70c70f9169692b36307a95f3d8c0a9fcd79f7b4a383aad5eaa0e9718b79b37"}, {file = "rpds_py-0.26.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:777c62479d12395bfb932944e61e915741e364c843afc3196b694db3d669fcd0"}, @@ -3840,6 +4013,7 @@ version = "4.9.1" description = "Pure-Python RSA implementation" optional = false python-versions = "<4,>=3.6" +groups = ["main"] files = [ {file = "rsa-4.9.1-py3-none-any.whl", hash = "sha256:68635866661c6836b8d39430f97a996acbd61bfa49406748ea243539fe239762"}, {file = "rsa-4.9.1.tar.gz", hash = "sha256:e7bdbfdb5497da4c07dfd35530e1a902659db6ff241e39d9953cad06ebd0ae75"}, @@ -3854,6 +4028,7 @@ version = "0.7.4" description = "An extremely fast Python linter and code formatter, written in Rust." optional = false python-versions = ">=3.7" +groups = ["dev"] files = [ {file = "ruff-0.7.4-py3-none-linux_armv6l.whl", hash = "sha256:a4919925e7684a3f18e18243cd6bea7cfb8e968a6eaa8437971f681b7ec51478"}, {file = "ruff-0.7.4-py3-none-macosx_10_12_x86_64.whl", hash = "sha256:cfb365c135b830778dda8c04fb7d4280ed0b984e1aec27f574445231e20d6c63"}, @@ -3881,6 +4056,7 @@ version = "0.13.1" description = "An Amazon S3 Transfer Manager" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "s3transfer-0.13.1-py3-none-any.whl", hash = "sha256:a981aa7429be23fe6dfc13e80e4020057cbab622b08c0315288758d67cabc724"}, {file = "s3transfer-0.13.1.tar.gz", hash = "sha256:c3fdba22ba1bd367922f27ec8032d6a1cf5f10c934fb5d68cf60fd5a23d936cf"}, @@ -3898,6 +4074,7 @@ version = "2.34.1" description = "Python client for Sentry (https://sentry.io)" optional = false python-versions = ">=3.6" +groups = ["main"] files = [ {file = "sentry_sdk-2.34.1-py2.py3-none-any.whl", hash = "sha256:b7a072e1cdc5abc48101d5146e1ae680fa81fe886d8d95aaa25a0b450c818d32"}, {file = "sentry_sdk-2.34.1.tar.gz", hash = "sha256:69274eb8c5c38562a544c3e9f68b5be0a43be4b697f5fd385bf98e4fbe672687"}, @@ -3955,6 +4132,7 @@ version = "70.0.0" description = "Easily download, build, install, upgrade, and uninstall Python packages" optional = false python-versions = ">=3.8" +groups = ["main"] files = [ {file = "setuptools-70.0.0-py3-none-any.whl", hash = "sha256:54faa7f2e8d2d11bcd2c07bed282eef1046b5c080d1c32add737d7b5817b1ad4"}, {file = "setuptools-70.0.0.tar.gz", hash = "sha256:f211a66637b8fa059bb28183da127d4e86396c991a942b028c6650d4319c3fd0"}, @@ -3962,7 +4140,7 @@ files = [ [package.extras] docs = ["furo", "jaraco.packaging (>=9.3)", "jaraco.tidelift (>=1.4)", "pygments-github-lexers (==0.0.5)", "pyproject-hooks (!=1.1)", "rst.linker (>=1.9)", "sphinx (>=3.5)", "sphinx-favicon", "sphinx-inline-tabs", "sphinx-lint", "sphinx-notfound-page (>=1,<2)", "sphinx-reredirects", "sphinxcontrib-towncrier"] -testing = ["build[virtualenv] (>=1.0.3)", "filelock (>=3.4.0)", "importlib-metadata", "ini2toml[lite] (>=0.14)", "jaraco.develop (>=7.21)", "jaraco.envs (>=2.2)", "jaraco.path (>=3.2.0)", "mypy (==1.9)", "packaging (>=23.2)", "pip (>=19.1)", "pyproject-hooks (!=1.1)", "pytest (>=6,!=8.1.1)", "pytest-checkdocs (>=2.4)", "pytest-cov", "pytest-enabler (>=2.2)", "pytest-home (>=0.5)", "pytest-mypy", "pytest-perf", "pytest-ruff (>=0.2.1)", "pytest-subprocess", "pytest-timeout", "pytest-xdist (>=3)", "tomli", "tomli-w (>=1.0.0)", "virtualenv (>=13.0.0)", "wheel"] +testing = ["build[virtualenv] (>=1.0.3)", "filelock (>=3.4.0)", "importlib-metadata", "ini2toml[lite] (>=0.14)", "jaraco.develop (>=7.21) ; python_version >= \"3.9\" and sys_platform != \"cygwin\"", "jaraco.envs (>=2.2)", "jaraco.path (>=3.2.0)", "mypy (==1.9)", "packaging (>=23.2)", "pip (>=19.1)", "pyproject-hooks (!=1.1)", "pytest (>=6,!=8.1.1)", "pytest-checkdocs (>=2.4)", "pytest-cov ; platform_python_implementation != \"PyPy\"", "pytest-enabler (>=2.2)", "pytest-home (>=0.5)", "pytest-mypy", "pytest-perf ; sys_platform != \"cygwin\"", "pytest-ruff (>=0.2.1) ; sys_platform != \"cygwin\"", "pytest-subprocess", "pytest-timeout", "pytest-xdist (>=3)", "tomli", "tomli-w (>=1.0.0)", "virtualenv (>=13.0.0)", "wheel"] [[package]] name = "shellingham" @@ -3970,6 +4148,7 @@ version = "1.5.4" description = "Tool to Detect Surrounding Shell" optional = false python-versions = ">=3.7" +groups = ["main"] files = [ {file = "shellingham-1.5.4-py2.py3-none-any.whl", hash = "sha256:7ecfff8f2fd72616f7481040475a65b2bf8af90a56c89140852d1120324e8686"}, {file = "shellingham-1.5.4.tar.gz", hash = "sha256:8dbca0739d487e5bd35ab3ca4b36e11c4078f3a234bfce294b0a0291363404de"}, @@ -3981,6 +4160,7 @@ version = "1.17.0" description = "Python 2 and 3 compatibility utilities" optional = false python-versions = "!=3.0.*,!=3.1.*,!=3.2.*,>=2.7" +groups = ["main", "dev"] files = [ {file = "six-1.17.0-py2.py3-none-any.whl", hash = "sha256:4721f391ed90541fddacab5acf947aa0d3dc7d27b2e1e8eda2be8970586c3274"}, {file = "six-1.17.0.tar.gz", hash = "sha256:ff70335d468e7eb6ec65b95b99d3a2836546063f63acc5171de367e834932a81"}, @@ -3992,6 +4172,7 @@ version = "1.23.0" description = "The Bolt Framework for Python" optional = false python-versions = ">=3.6" +groups = ["main"] files = [ {file = "slack_bolt-1.23.0-py2.py3-none-any.whl", hash = "sha256:6d6ae39d80c964c362505ae4e587eed2b26dbc3a9f0cb76af1150c30fb670488"}, {file = "slack_bolt-1.23.0.tar.gz", hash = "sha256:3d2c3eb13131407a94f925eb22b180d352c2d97b808303ef92b7a46d6508c843"}, @@ -4006,6 +4187,7 @@ version = "3.36.0" description = "The Slack API Platform SDK for Python" optional = false python-versions = ">=3.6" +groups = ["main"] files = [ {file = "slack_sdk-3.36.0-py2.py3-none-any.whl", hash = "sha256:6c96887d7175fc1b0b2777b73bb65f39b5b8bee9bd8acfec071d64014f9e2d10"}, {file = "slack_sdk-3.36.0.tar.gz", hash = "sha256:8586022bdbdf9f8f8d32f394540436c53b1e7c8da9d21e1eab4560ba70cfcffa"}, @@ -4020,6 +4202,7 @@ version = "5.0.2" description = "A pure Python implementation of a sliding window memory map manager" optional = false python-versions = ">=3.7" +groups = ["dev"] files = [ {file = "smmap-5.0.2-py3-none-any.whl", hash = "sha256:b30115f0def7d7531d22a0fb6502488d879e75b260a9db4d0819cfb25403af5e"}, {file = "smmap-5.0.2.tar.gz", hash = "sha256:26ea65a03958fa0c8a1c7e8c7a58fdc77221b8910f6be2131affade476898ad5"}, @@ -4031,6 +4214,7 @@ version = "1.3.1" description = "Sniff out which async library your code is running under" optional = false python-versions = ">=3.7" +groups = ["main"] files = [ {file = "sniffio-1.3.1-py3-none-any.whl", hash = "sha256:2f6da418d1f1e0fddd844478f41680e794e6051915791a034ff65e5f100525a2"}, {file = "sniffio-1.3.1.tar.gz", hash = "sha256:f4324edc670a0f49750a81b895f35c3adb843cca46f0530f79fc1babb23789dc"}, @@ -4042,6 +4226,7 @@ version = "2.7" description = "A modern CSS selector implementation for Beautiful Soup." optional = false python-versions = ">=3.8" +groups = ["main"] files = [ {file = "soupsieve-2.7-py3-none-any.whl", hash = "sha256:6e60cc5c1ffaf1cebcc12e8188320b72071e922c2e897f737cadce79ad5d30c4"}, {file = "soupsieve-2.7.tar.gz", hash = "sha256:ad282f9b6926286d2ead4750552c8a6142bc4c783fd66b0293547c8fe6ae126a"}, @@ -4053,6 +4238,7 @@ version = "3.0.2" description = "SSE plugin for Starlette" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "sse_starlette-3.0.2-py3-none-any.whl", hash = "sha256:16b7cbfddbcd4eaca11f7b586f3b8a080f1afe952c15813455b162edea619e5a"}, {file = "sse_starlette-3.0.2.tar.gz", hash = "sha256:ccd60b5765ebb3584d0de2d7a6e4f745672581de4f5005ab31c3a25d10b52b3a"}, @@ -4073,6 +4259,7 @@ version = "1.8.0" description = "SSE client for Python" optional = false python-versions = "*" +groups = ["dev"] files = [ {file = "sseclient-py-1.8.0.tar.gz", hash = "sha256:c547c5c1a7633230a38dc599a21a2dc638f9b5c297286b48b46b935c71fac3e8"}, {file = "sseclient_py-1.8.0-py2.py3-none-any.whl", hash = "sha256:4ecca6dc0b9f963f8384e9d7fd529bf93dd7d708144c4fb5da0e0a1a926fee83"}, @@ -4084,6 +4271,7 @@ version = "0.47.2" description = "The little ASGI library that shines." optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "starlette-0.47.2-py3-none-any.whl", hash = "sha256:c5847e96134e5c5371ee9fac6fdf1a67336d5815e09eb2a01fdb57a351ef915b"}, {file = "starlette-0.47.2.tar.gz", hash = "sha256:6ae9aa5db235e4846decc1e7b79c4f346adf41e9777aebeb49dfd09bbd7023d8"}, @@ -4102,6 +4290,7 @@ version = "0.12.0" description = "Supabase Storage client for Python." optional = false python-versions = "<4.0,>=3.9" +groups = ["main"] files = [ {file = "storage3-0.12.0-py3-none-any.whl", hash = "sha256:1c4585693ca42243ded1512b58e54c697111e91a20916cd14783eebc37e7c87d"}, {file = "storage3-0.12.0.tar.gz", hash = "sha256:94243f20922d57738bf42e96b9f5582b4d166e8bf209eccf20b146909f3f71b0"}, @@ -4118,6 +4307,7 @@ version = "0.4.15" description = "An Enum that inherits from str." optional = false python-versions = "*" +groups = ["main"] files = [ {file = "StrEnum-0.4.15-py3-none-any.whl", hash = "sha256:a30cda4af7cc6b5bf52c8055bc4bf4b2b6b14a93b574626da33df53cf7740659"}, {file = "StrEnum-0.4.15.tar.gz", hash = "sha256:878fb5ab705442070e4dd1929bb5e2249511c0bcf2b0eeacf3bcd80875c82eff"}, @@ -4134,6 +4324,7 @@ version = "2.17.0" description = "Supabase client for Python." optional = false python-versions = "<4.0,>=3.9" +groups = ["main"] files = [ {file = "supabase-2.17.0-py3-none-any.whl", hash = "sha256:2dd804fae8850cebccc9ab8711c2ee9e2f009e847f4c95c092a4423778e3c3f6"}, {file = "supabase-2.17.0.tar.gz", hash = "sha256:3207314b540db7e3339fa2500bd977541517afb4d20b7ff93a89b97a05f9df38"}, @@ -4153,6 +4344,7 @@ version = "0.10.1" description = "Library for Supabase Functions" optional = false python-versions = "<4.0,>=3.9" +groups = ["main"] files = [ {file = "supafunc-0.10.1-py3-none-any.whl", hash = "sha256:26df9bd25ff2ef56cb5bfb8962de98f43331f7f8ff69572bac3ed9c3a9672040"}, {file = "supafunc-0.10.1.tar.gz", hash = "sha256:a5b33c8baecb6b5297d25da29a2503e2ec67ee6986f3d44c137e651b8a59a17d"}, @@ -4168,6 +4360,7 @@ version = "9.1.2" description = "Retry code until it succeeds" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "tenacity-9.1.2-py3-none-any.whl", hash = "sha256:f77bf36710d8b73a50b2dd155c97b870017ad21afe6ab300326b0371b3b05138"}, {file = "tenacity-9.1.2.tar.gz", hash = "sha256:1169d376c297e7de388d18b4481760d478b0e99a777cad3a9c86e556f4b697cb"}, @@ -4183,6 +4376,7 @@ version = "1.3" description = "The most basic Text::Unidecode port" optional = false python-versions = "*" +groups = ["main", "dev"] files = [ {file = "text-unidecode-1.3.tar.gz", hash = "sha256:bad6603bb14d279193107714b288be206cac565dfa49aa5b105294dd5c4aab93"}, {file = "text_unidecode-1.3-py2.py3-none-any.whl", hash = "sha256:1311f10e8b895935241623731c2ba64f4c455287888b18189350b67134a822e8"}, @@ -4194,6 +4388,7 @@ version = "0.9.0" description = "tiktoken is a fast BPE tokeniser for use with OpenAI's models" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "tiktoken-0.9.0-cp310-cp310-macosx_10_12_x86_64.whl", hash = "sha256:586c16358138b96ea804c034b8acf3f5d3f0258bd2bc3b0227af4af5d622e382"}, {file = "tiktoken-0.9.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:d9c59ccc528c6c5dd51820b3474402f69d9a9e1d656226848ad68a8d5b2e5108"}, @@ -4241,6 +4436,7 @@ version = "0.21.4" description = "" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "tokenizers-0.21.4-cp39-abi3-macosx_10_12_x86_64.whl", hash = "sha256:2ccc10a7c3bcefe0f242867dc914fc1226ee44321eb618cfe3019b5df3400133"}, {file = "tokenizers-0.21.4-cp39-abi3-macosx_11_0_arm64.whl", hash = "sha256:5e2f601a8e0cd5be5cc7506b20a79112370b9b3e9cb5f13f68ab11acd6ca7d60"}, @@ -4273,6 +4469,7 @@ version = "2.2.1" description = "A lil' TOML parser" optional = false python-versions = ">=3.8" +groups = ["main", "dev"] files = [ {file = "tomli-2.2.1-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:678e4fa69e4575eb77d103de3df8a895e1591b48e740211bd1067378c69e8249"}, {file = "tomli-2.2.1-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:023aa114dd824ade0100497eb2318602af309e5a55595f76b626d6d9f3b7b0a6"}, @@ -4307,6 +4504,7 @@ files = [ {file = "tomli-2.2.1-py3-none-any.whl", hash = "sha256:cb55c73c5f4408779d0cf3eef9f762b9c9f147a77de7b258bef0a5628adc85cc"}, {file = "tomli-2.2.1.tar.gz", hash = "sha256:cd45e1dc79c835ce60f7404ec8119f2eb06d38b1deba146f07ced3bbc44505ff"}, ] +markers = {main = "python_version == \"3.10\"", dev = "python_full_version <= \"3.11.0a6\""} [[package]] name = "tqdm" @@ -4314,6 +4512,7 @@ version = "4.67.1" description = "Fast, Extensible Progress Meter" optional = false python-versions = ">=3.7" +groups = ["main", "dev"] files = [ {file = "tqdm-4.67.1-py3-none-any.whl", hash = "sha256:26445eca388f82e72884e0d580d5464cd801a3ea01e63e5601bdff9ba6a48de2"}, {file = "tqdm-4.67.1.tar.gz", hash = "sha256:f8aef9c52c08c13a65f30ea34f4e5aac3fd1a34959879d7e59e63027286627f2"}, @@ -4335,6 +4534,7 @@ version = "0.15.4" description = "Typer, build great CLIs. Easy to code. Based on Python type hints." optional = false python-versions = ">=3.7" +groups = ["main"] files = [ {file = "typer-0.15.4-py3-none-any.whl", hash = "sha256:eb0651654dcdea706780c466cf06d8f174405a659ffff8f163cfbfee98c0e173"}, {file = "typer-0.15.4.tar.gz", hash = "sha256:89507b104f9b6a0730354f27c39fae5b63ccd0c95b1ce1f1a6ba0cfd329997c3"}, @@ -4363,6 +4563,7 @@ version = "6.0.12.20250516" description = "Typing stubs for PyYAML" optional = false python-versions = ">=3.9" +groups = ["dev"] files = [ {file = "types_pyyaml-6.0.12.20250516-py3-none-any.whl", hash = "sha256:8478208feaeb53a34cb5d970c56a7cd76b72659442e733e268a94dc72b2d0530"}, {file = "types_pyyaml-6.0.12.20250516.tar.gz", hash = "sha256:9f21a70216fc0fa1b216a8176db5f9e0af6eb35d2f2932acb87689d03a5bf6ba"}, @@ -4374,6 +4575,7 @@ version = "4.14.1" description = "Backported and Experimental Type Hints for Python 3.9+" optional = false python-versions = ">=3.9" +groups = ["main", "dev"] files = [ {file = "typing_extensions-4.14.1-py3-none-any.whl", hash = "sha256:d1e1e3b58374dc93031d6eda2420a48ea44a36c2b4766a4fdeb3710755731d76"}, {file = "typing_extensions-4.14.1.tar.gz", hash = "sha256:38b39f4aeeab64884ce9f74c94263ef78f3c22467c8724005483154c26648d36"}, @@ -4385,6 +4587,7 @@ version = "0.4.1" description = "Runtime typing introspection tools" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "typing_inspection-0.4.1-py3-none-any.whl", hash = "sha256:389055682238f53b04f7badcb49b989835495a96700ced5dab2d8feae4b26f51"}, {file = "typing_inspection-0.4.1.tar.gz", hash = "sha256:6ae134cc0203c33377d43188d4064e9b357dba58cff3185f22924610e70a9d28"}, @@ -4399,6 +4602,7 @@ version = "4.2.0" description = "Implementation of RFC 6570 URI Templates" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "uritemplate-4.2.0-py3-none-any.whl", hash = "sha256:962201ba1c4edcab02e60f9a0d3821e82dfc5d2d6662a21abd533879bdb8a686"}, {file = "uritemplate-4.2.0.tar.gz", hash = "sha256:480c2ed180878955863323eea31b0ede668795de182617fef9c6ca09e6ec9d0e"}, @@ -4410,14 +4614,15 @@ version = "1.26.20" description = "HTTP library with thread-safe connection pooling, file post, and more." optional = false python-versions = "!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*,!=3.5.*,>=2.7" +groups = ["main", "dev"] files = [ {file = "urllib3-1.26.20-py2.py3-none-any.whl", hash = "sha256:0ed14ccfbf1c30a9072c7ca157e4319b70d65f623e91e7b32fadb2853431016e"}, {file = "urllib3-1.26.20.tar.gz", hash = "sha256:40c2dc0c681e47eb8f90e7e27bf6ff7df2e677421fd46756da1161c39ca70d32"}, ] [package.extras] -brotli = ["brotli (==1.0.9)", "brotli (>=1.0.9)", "brotlicffi (>=0.8.0)", "brotlipy (>=0.6.0)"] -secure = ["certifi", "cryptography (>=1.3.4)", "idna (>=2.0.0)", "ipaddress", "pyOpenSSL (>=0.14)", "urllib3-secure-extra"] +brotli = ["brotli (==1.0.9) ; os_name != \"nt\" and python_version < \"3\" and platform_python_implementation == \"CPython\"", "brotli (>=1.0.9) ; python_version >= \"3\" and platform_python_implementation == \"CPython\"", "brotlicffi (>=0.8.0) ; (os_name != \"nt\" or python_version >= \"3\") and platform_python_implementation != \"CPython\"", "brotlipy (>=0.6.0) ; os_name == \"nt\" and python_version < \"3\""] +secure = ["certifi", "cryptography (>=1.3.4)", "idna (>=2.0.0)", "ipaddress ; python_version == \"2.7\"", "pyOpenSSL (>=0.14)", "urllib3-secure-extra"] socks = ["PySocks (>=1.5.6,!=1.5.7,<2.0)"] [[package]] @@ -4426,6 +4631,7 @@ version = "0.30.6" description = "The lightning-fast ASGI server." optional = false python-versions = ">=3.8" +groups = ["main"] files = [ {file = "uvicorn-0.30.6-py3-none-any.whl", hash = "sha256:65fd46fe3fda5bdc1b03b94eb634923ff18cd35b2f084813ea79d1f103f711b5"}, {file = "uvicorn-0.30.6.tar.gz", hash = "sha256:4b15decdda1e72be08209e860a1e10e92439ad5b97cf44cc945fcbee66fc5788"}, @@ -4437,7 +4643,7 @@ h11 = ">=0.8" typing-extensions = {version = ">=4.0", markers = "python_version < \"3.11\""} [package.extras] -standard = ["colorama (>=0.4)", "httptools (>=0.5.0)", "python-dotenv (>=0.13)", "pyyaml (>=5.1)", "uvloop (>=0.14.0,!=0.15.0,!=0.15.1)", "watchfiles (>=0.13)", "websockets (>=10.4)"] +standard = ["colorama (>=0.4) ; sys_platform == \"win32\"", "httptools (>=0.5.0)", "python-dotenv (>=0.13)", "pyyaml (>=5.1)", "uvloop (>=0.14.0,!=0.15.0,!=0.15.1) ; sys_platform != \"win32\" and sys_platform != \"cygwin\" and platform_python_implementation != \"PyPy\"", "watchfiles (>=0.13)", "websockets (>=10.4)"] [[package]] name = "virtualenv" @@ -4445,6 +4651,7 @@ version = "20.32.0" description = "Virtual Python Environment builder" optional = false python-versions = ">=3.8" +groups = ["dev"] files = [ {file = "virtualenv-20.32.0-py3-none-any.whl", hash = "sha256:2c310aecb62e5aa1b06103ed7c2977b81e042695de2697d01017ff0f1034af56"}, {file = "virtualenv-20.32.0.tar.gz", hash = "sha256:886bf75cadfdc964674e6e33eb74d787dff31ca314ceace03ca5810620f4ecf0"}, @@ -4457,7 +4664,7 @@ platformdirs = ">=3.9.1,<5" [package.extras] docs = ["furo (>=2023.7.26)", "proselint (>=0.13)", "sphinx (>=7.1.2,!=7.3)", "sphinx-argparse (>=0.4)", "sphinxcontrib-towncrier (>=0.2.1a0)", "towncrier (>=23.6)"] -test = ["covdefaults (>=2.3)", "coverage (>=7.2.7)", "coverage-enable-subprocess (>=1)", "flaky (>=3.7)", "packaging (>=23.1)", "pytest (>=7.4)", "pytest-env (>=0.8.2)", "pytest-freezer (>=0.4.8)", "pytest-mock (>=3.11.1)", "pytest-randomly (>=3.12)", "pytest-timeout (>=2.1)", "setuptools (>=68)", "time-machine (>=2.10)"] +test = ["covdefaults (>=2.3)", "coverage (>=7.2.7)", "coverage-enable-subprocess (>=1)", "flaky (>=3.7)", "packaging (>=23.1)", "pytest (>=7.4)", "pytest-env (>=0.8.2)", "pytest-freezer (>=0.4.8) ; platform_python_implementation == \"PyPy\" or platform_python_implementation == \"GraalVM\" or platform_python_implementation == \"CPython\" and sys_platform == \"win32\" and python_version >= \"3.13\"", "pytest-mock (>=3.11.1)", "pytest-randomly (>=3.12)", "pytest-timeout (>=2.1)", "setuptools (>=68)", "time-machine (>=2.10) ; platform_python_implementation == \"CPython\""] [[package]] name = "watchdog" @@ -4465,6 +4672,7 @@ version = "6.0.0" description = "Filesystem events monitoring" optional = false python-versions = ">=3.9" +groups = ["dev"] files = [ {file = "watchdog-6.0.0-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:d1cdb490583ebd691c012b3d6dae011000fe42edb7a82ece80965b42abd61f26"}, {file = "watchdog-6.0.0-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:bc64ab3bdb6a04d69d4023b29422170b74681784ffb9463ed4870cf2f3e66112"}, @@ -4507,6 +4715,7 @@ version = "0.2.13" description = "Measures the displayed width of unicode strings in a terminal" optional = false python-versions = "*" +groups = ["main"] files = [ {file = "wcwidth-0.2.13-py2.py3-none-any.whl", hash = "sha256:3da69048e4540d84af32131829ff948f1e022c1c6bdb8d6102117aac784f6859"}, {file = "wcwidth-0.2.13.tar.gz", hash = "sha256:72ea0c06399eb286d978fdedb6923a9eb47e1c486ce63e9b4e64fc18303972b5"}, @@ -4518,6 +4727,7 @@ version = "1.8.0" description = "WebSocket client for Python with low level API options" optional = false python-versions = ">=3.8" +groups = ["main"] files = [ {file = "websocket_client-1.8.0-py3-none-any.whl", hash = "sha256:17b44cc997f5c498e809b22cdf2d9c7a9e71c02c8cc2b6c56e7c2d1239bfa526"}, {file = "websocket_client-1.8.0.tar.gz", hash = "sha256:3239df9f44da632f96012472805d40a23281a991027ce11d2f45a6f24ac4c3da"}, @@ -4534,6 +4744,7 @@ version = "15.0.1" description = "An implementation of the WebSocket Protocol (RFC 6455 & 7692)" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "websockets-15.0.1-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:d63efaa0cd96cf0c5fe4d581521d9fa87744540d4bc999ae6e08595a1014b45b"}, {file = "websockets-15.0.1-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:ac60e3b188ec7574cb761b08d50fcedf9d77f1530352db4eef1707fe9dee7205"}, @@ -4612,6 +4823,7 @@ version = "1.20.1" description = "Yet another URL library" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "yarl-1.20.1-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:6032e6da6abd41e4acda34d75a816012717000fa6839f37124a47fcefc49bec4"}, {file = "yarl-1.20.1-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:2c7b34d804b8cf9b214f05015c4fee2ebe7ed05cf581e7192c06555c71f4446a"}, @@ -4730,13 +4942,14 @@ version = "3.23.0" description = "Backport of pathlib-compatible object wrapper for zip files" optional = false python-versions = ">=3.9" +groups = ["main"] files = [ {file = "zipp-3.23.0-py3-none-any.whl", hash = "sha256:071652d6115ed432f5ce1d34c336c0adfd6a884660d1e9712a256d3d3bd4b14e"}, {file = "zipp-3.23.0.tar.gz", hash = "sha256:a07157588a12518c9d4034df3fbbee09c814741a33ff63c05fa29d26a2404166"}, ] [package.extras] -check = ["pytest-checkdocs (>=2.4)", "pytest-ruff (>=0.2.1)"] +check = ["pytest-checkdocs (>=2.4)", "pytest-ruff (>=0.2.1) ; sys_platform != \"cygwin\""] cover = ["pytest-cov"] doc = ["furo", "jaraco.packaging (>=9.3)", "jaraco.tidelift (>=1.4)", "rst.linker (>=1.9)", "sphinx (>=3.5)", "sphinx-lint"] enabler = ["pytest-enabler (>=2.2)"] @@ -4744,6 +4957,6 @@ test = ["big-O", "jaraco.functools", "jaraco.itertools", "jaraco.test", "more_it type = ["pytest-mypy"] [metadata] -lock-version = "2.0" +lock-version = "2.1" python-versions = "^3.10" -content-hash = "78cd14bdfd8ebebe56b8b22a76ea8c22ecc16cc62e8a697e6e760e2ae05f12e3" +content-hash = "341bea8283fb05118e2240a474ced564926e30059dfefd8b9965846941aa9632" diff --git a/pyproject.toml b/pyproject.toml index d425dc428..6d42176bc 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -58,6 +58,7 @@ azure-monitor-query = "^1.2.0" azure-mgmt-monitor = "^7.0.0b1" azure-mgmt-alertsmanagement = "^1.0.0" azure-mgmt-resource = "^23.3.0" +azure-mgmt-resourcegraph = "^8.0.0" protobuf = ">=6.31.1" tenacity = "^9.1.2" diff --git a/tests/plugins/toolsets/azuremonitor_metrics/README.md b/tests/plugins/toolsets/azuremonitor_metrics/README.md new file mode 100644 index 000000000..b8c3f29c2 --- /dev/null +++ b/tests/plugins/toolsets/azuremonitor_metrics/README.md @@ -0,0 +1,109 @@ +# Azure Monitor Metrics Toolset Tests + +This directory contains comprehensive unit tests for the Azure Monitor Metrics toolset individual tools. + +## Test Coverage + +The test suite covers all 7 tools in the Azure Monitor Metrics toolset: + +### 1. CheckAKSClusterContext +- ✅ Successful AKS environment detection (running in AKS) +- ✅ Successful detection of non-AKS environment +- ✅ Error handling during environment detection + +### 2. GetAKSClusterResourceID +- ✅ Successful cluster resource ID retrieval +- ✅ Failure when cluster cannot be found +- ✅ Exception handling during cluster detection + +### 3. CheckAzureMonitorPrometheusEnabled +- ✅ Successful detection of enabled Azure Monitor Prometheus +- ✅ Failure when Prometheus is not enabled +- ✅ Failure when no cluster is specified +- ✅ Exception handling during workspace queries + +### 4. ExecuteAzureMonitorPrometheusQuery +- ✅ Successful PromQL query execution with data +- ✅ Query execution with no data returned +- ✅ HTTP error handling (401, etc.) +- ✅ Prometheus query error handling +- ✅ Workspace not configured scenarios +- ✅ Connection error handling + +### 5. GetActivePrometheusAlerts +- ✅ Successful retrieval of multiple alerts +- ✅ Specific alert retrieval by ID +- ✅ No alerts found scenarios +- ✅ Missing cluster handling +- ✅ Source plugin error handling + +### 6. ExecuteAzureMonitorPrometheusRangeQuery +- ✅ Successful range query execution +- ✅ Step size calculation logic +- ✅ User-provided step size handling +- ✅ Minimum step size enforcement +- ✅ Workspace configuration validation + +### 7. ExecuteAlertPromQLQuery +- ✅ Successful alert query execution +- ✅ Alert not found scenarios +- ✅ Query extraction failures +- ✅ Time range parsing (valid and invalid formats) + +## Test Structure + +Each tool class has its own test class following the pattern: +```python +class TestToolName: + def setup_method(self): + # Set up test fixtures + + def test_success_scenario(self): + # Test successful operations + + def test_failure_scenario(self): + # Test error conditions +``` + +## Mocking Strategy + +The tests use comprehensive mocking for: + +- **Azure Authentication**: Mock `_get_authenticated_headers()` and token acquisition +- **HTTP Requests**: Mock `requests.get/post` calls to Azure Monitor endpoints +- **Azure Resource Graph**: Mock ARG queries for workspace discovery +- **Utility Functions**: Mock cluster detection and resource ID parsing +- **Source Plugins**: Mock `AzureMonitorAlertsSource` for alert fetching +- **External Commands**: Mock kubectl and Azure CLI interactions + +## Running the Tests + +```bash +# Run all tests +poetry run pytest tests/plugins/toolsets/azuremonitor_metrics/test_azuremonitor_metrics_tools.py -v + +# Run specific tool tests +poetry run pytest tests/plugins/toolsets/azuremonitor_metrics/test_azuremonitor_metrics_tools.py::TestCheckAKSClusterContext -v + +# Run specific test +poetry run pytest tests/plugins/toolsets/azuremonitor_metrics/test_azuremonitor_metrics_tools.py::TestCheckAKSClusterContext::test_success_running_in_aks -v +``` + +## Test Benefits + +1. **Comprehensive Coverage**: Tests all major functionality and edge cases +2. **Isolated Testing**: Each tool tested independently with proper mocking +3. **Real-world Scenarios**: Covers actual usage patterns and failure modes +4. **Maintainable**: Clear organization and reusable patterns +5. **Integration Ready**: Follows existing project test patterns and conventions + +## Key Features Tested + +- **Azure Integration**: Authentication, token handling, workspace discovery +- **AKS Detection**: Environment detection, cluster resource ID retrieval +- **PromQL Processing**: Query enhancement, cluster filtering, response parsing +- **Error Handling**: Network errors, authentication failures, invalid parameters +- **Alert Management**: Alert fetching, formatting, investigation workflows +- **Configuration Validation**: Workspace setup, parameter validation, defaults + +The test suite ensures the Azure Monitor Metrics toolset is robust, reliable, and handles all expected scenarios gracefully. diff --git a/tests/plugins/toolsets/azuremonitor_metrics/test_azuremonitor_metrics_tools.py b/tests/plugins/toolsets/azuremonitor_metrics/test_azuremonitor_metrics_tools.py new file mode 100644 index 000000000..8d8872324 --- /dev/null +++ b/tests/plugins/toolsets/azuremonitor_metrics/test_azuremonitor_metrics_tools.py @@ -0,0 +1,886 @@ +"""Unit tests for Azure Monitor Metrics toolset individual tools.""" + +import json +import pytest +from unittest.mock import MagicMock, patch, Mock +from requests import RequestException + +from holmes.core.tools import ToolResultStatus, StructuredToolResult +from holmes.plugins.toolsets.azuremonitor_metrics.azuremonitor_metrics_toolset import ( + AzureMonitorMetricsToolset, + AzureMonitorMetricsConfig, + CheckAKSClusterContext, + GetAKSClusterResourceID, + CheckAzureMonitorPrometheusEnabled, + GetActivePrometheusAlerts, + ExecuteAzureMonitorPrometheusQuery, + ExecuteAlertPromQLQuery, + ExecuteAzureMonitorPrometheusRangeQuery, +) + + +class TestCheckAKSClusterContext: + """Tests for CheckAKSClusterContext tool.""" + + def setup_method(self): + """Set up test fixtures.""" + self.toolset = AzureMonitorMetricsToolset() + self.toolset.config = AzureMonitorMetricsConfig() + self.tool = CheckAKSClusterContext(self.toolset) + + @patch('holmes.plugins.toolsets.azuremonitor_metrics.azuremonitor_metrics_toolset.check_if_running_in_aks') + def test_success_running_in_aks(self, mock_check_aks): + """Test successful detection of AKS environment.""" + # Mock AKS detection to return True + mock_check_aks.return_value = True + + # Execute tool + result = self.tool._invoke({}) + + # Verify result + assert result.status == ToolResultStatus.SUCCESS + assert isinstance(result.data, dict) + assert result.data["running_in_aks"] is True + assert "Running in AKS cluster" in result.data["message"] + + @patch('holmes.plugins.toolsets.azuremonitor_metrics.azuremonitor_metrics_toolset.check_if_running_in_aks') + def test_success_not_running_in_aks(self, mock_check_aks): + """Test successful detection of non-AKS environment.""" + # Mock AKS detection to return False + mock_check_aks.return_value = False + + # Execute tool + result = self.tool._invoke({}) + + # Verify result + assert result.status == ToolResultStatus.SUCCESS + assert isinstance(result.data, dict) + assert result.data["running_in_aks"] is False + assert "Not running in AKS cluster" in result.data["message"] + + @patch('holmes.plugins.toolsets.azuremonitor_metrics.azuremonitor_metrics_toolset.check_if_running_in_aks') + def test_failure_detection_error(self, mock_check_aks): + """Test error handling during AKS detection.""" + # Mock AKS detection to raise exception + mock_check_aks.side_effect = Exception("Environment detection failed") + + # Execute tool + result = self.tool._invoke({}) + + # Verify result + assert result.status == ToolResultStatus.ERROR + assert "Failed to check AKS cluster context" in result.error + assert "Environment detection failed" in result.error + + +class TestGetAKSClusterResourceID: + """Tests for GetAKSClusterResourceID tool.""" + + def setup_method(self): + """Set up test fixtures.""" + self.toolset = AzureMonitorMetricsToolset() + self.toolset.config = AzureMonitorMetricsConfig() + self.tool = GetAKSClusterResourceID(self.toolset) + + @patch('holmes.plugins.toolsets.azuremonitor_metrics.azuremonitor_metrics_toolset.get_aks_cluster_resource_id') + @patch('holmes.plugins.toolsets.azuremonitor_metrics.azuremonitor_metrics_toolset.extract_cluster_name_from_resource_id') + def test_success_cluster_found(self, mock_extract_name, mock_get_resource_id): + """Test successful cluster resource ID retrieval.""" + # Mock cluster resource ID and name extraction + test_resource_id = "/subscriptions/test-sub/resourceGroups/test-rg/providers/Microsoft.ContainerService/managedClusters/test-cluster" + test_cluster_name = "test-cluster" + + mock_get_resource_id.return_value = test_resource_id + mock_extract_name.return_value = test_cluster_name + + # Execute tool + result = self.tool._invoke({}) + + # Verify result + assert result.status == ToolResultStatus.SUCCESS + assert isinstance(result.data, dict) + assert result.data["cluster_resource_id"] == test_resource_id + assert result.data["cluster_name"] == test_cluster_name + assert "Found AKS cluster: test-cluster" in result.data["message"] + + @patch('holmes.plugins.toolsets.azuremonitor_metrics.azuremonitor_metrics_toolset.get_aks_cluster_resource_id') + def test_failure_cluster_not_found(self, mock_get_resource_id): + """Test failure when cluster resource ID cannot be determined.""" + # Mock cluster resource ID to return None + mock_get_resource_id.return_value = None + + # Execute tool + result = self.tool._invoke({}) + + # Verify result + assert result.status == ToolResultStatus.ERROR + assert "Could not determine AKS cluster resource ID" in result.error + + @patch('holmes.plugins.toolsets.azuremonitor_metrics.azuremonitor_metrics_toolset.get_aks_cluster_resource_id') + def test_failure_exception_during_detection(self, mock_get_resource_id): + """Test error handling during cluster detection.""" + # Mock cluster detection to raise exception + mock_get_resource_id.side_effect = Exception("Azure CLI not available") + + # Execute tool + result = self.tool._invoke({}) + + # Verify result + assert result.status == ToolResultStatus.ERROR + assert "Failed to get AKS cluster resource ID" in result.error + assert "Azure CLI not available" in result.error + + +class TestCheckAzureMonitorPrometheusEnabled: + """Tests for CheckAzureMonitorPrometheusEnabled tool.""" + + def setup_method(self): + """Set up test fixtures.""" + self.toolset = AzureMonitorMetricsToolset() + self.toolset.config = AzureMonitorMetricsConfig() + self.tool = CheckAzureMonitorPrometheusEnabled(self.toolset) + + @patch('holmes.plugins.toolsets.azuremonitor_metrics.azuremonitor_metrics_toolset.get_azure_monitor_workspace_for_cluster') + @patch('holmes.plugins.toolsets.azuremonitor_metrics.azuremonitor_metrics_toolset.extract_cluster_name_from_resource_id') + def test_success_prometheus_enabled(self, mock_extract_name, mock_get_workspace): + """Test successful detection of enabled Azure Monitor Prometheus.""" + # Mock workspace info and cluster name + test_resource_id = "/subscriptions/test-sub/resourceGroups/test-rg/providers/Microsoft.ContainerService/managedClusters/test-cluster" + test_cluster_name = "test-cluster" + workspace_info = { + "prometheus_query_endpoint": "https://test-workspace.prometheus.monitor.azure.com", + "azure_monitor_workspace_resource_id": "/subscriptions/test-sub/resourceGroups/test-rg/providers/microsoft.monitor/accounts/test-workspace", + "location": "eastus", + "associated_grafanas": [] + } + + mock_get_workspace.return_value = workspace_info + mock_extract_name.return_value = test_cluster_name + + # Execute tool with cluster resource ID + params = {"cluster_resource_id": test_resource_id} + result = self.tool._invoke(params) + + # Verify result + assert result.status == ToolResultStatus.SUCCESS + assert isinstance(result.data, dict) + assert result.data["azure_monitor_prometheus_enabled"] is True + assert result.data["cluster_name"] == test_cluster_name + assert result.data["prometheus_query_endpoint"] == workspace_info["prometheus_query_endpoint"] + assert "Azure Monitor managed Prometheus is enabled" in result.data["message"] + + # Verify toolset config was updated + assert self.toolset.config.azure_monitor_workspace_endpoint == workspace_info["prometheus_query_endpoint"] + assert self.toolset.config.cluster_name == test_cluster_name + + @patch('holmes.plugins.toolsets.azuremonitor_metrics.azuremonitor_metrics_toolset.get_azure_monitor_workspace_for_cluster') + @patch('holmes.plugins.toolsets.azuremonitor_metrics.azuremonitor_metrics_toolset.extract_cluster_name_from_resource_id') + def test_failure_prometheus_not_enabled(self, mock_extract_name, mock_get_workspace): + """Test failure when Azure Monitor Prometheus is not enabled.""" + # Mock workspace not found + test_resource_id = "/subscriptions/test-sub/resourceGroups/test-rg/providers/Microsoft.ContainerService/managedClusters/test-cluster" + test_cluster_name = "test-cluster" + + mock_get_workspace.return_value = None + mock_extract_name.return_value = test_cluster_name + + # Execute tool + params = {"cluster_resource_id": test_resource_id} + result = self.tool._invoke(params) + + # Verify result + assert result.status == ToolResultStatus.ERROR + assert "Azure Monitor managed Prometheus is not enabled" in result.error + assert test_cluster_name in result.error + + @patch('holmes.plugins.toolsets.azuremonitor_metrics.azuremonitor_metrics_toolset.get_aks_cluster_resource_id') + def test_failure_no_cluster_specified(self, mock_get_resource_id): + """Test failure when no cluster is specified or auto-detected.""" + # Mock auto-detection to return None + mock_get_resource_id.return_value = None + + # Execute tool without cluster resource ID + result = self.tool._invoke({}) + + # Verify result + assert result.status == ToolResultStatus.ERROR + assert "No AKS cluster specified" in result.error + + @patch('holmes.plugins.toolsets.azuremonitor_metrics.azuremonitor_metrics_toolset.get_azure_monitor_workspace_for_cluster') + def test_failure_workspace_query_exception(self, mock_get_workspace): + """Test error handling during workspace query.""" + # Mock workspace query to raise exception + test_resource_id = "/subscriptions/test-sub/resourceGroups/test-rg/providers/Microsoft.ContainerService/managedClusters/test-cluster" + mock_get_workspace.side_effect = Exception("Azure Resource Graph query failed") + + # Execute tool + params = {"cluster_resource_id": test_resource_id} + result = self.tool._invoke(params) + + # Verify result + assert result.status == ToolResultStatus.ERROR + assert "Failed to check Azure Monitor Prometheus status" in result.error + assert "Azure Resource Graph query failed" in result.error + + +class TestExecuteAzureMonitorPrometheusQuery: + """Tests for ExecuteAzureMonitorPrometheusQuery tool.""" + + def setup_method(self): + """Set up test fixtures.""" + self.toolset = AzureMonitorMetricsToolset() + self.toolset.config = AzureMonitorMetricsConfig( + azure_monitor_workspace_endpoint="https://test-workspace.prometheus.monitor.azure.com/", + cluster_name="test-cluster", + tool_calls_return_data=True + ) + self.tool = ExecuteAzureMonitorPrometheusQuery(self.toolset) + + @patch('holmes.plugins.toolsets.azuremonitor_metrics.azuremonitor_metrics_toolset.enhance_promql_with_cluster_filter') + @patch('requests.post') + def test_success_query_with_data(self, mock_post, mock_enhance_query): + """Test successful PromQL query execution with data.""" + # Mock query enhancement and HTTP response + original_query = "up" + enhanced_query = 'up{cluster="test-cluster"}' + mock_enhance_query.return_value = enhanced_query + + mock_response = Mock() + mock_response.status_code = 200 + mock_response.json.return_value = { + "status": "success", + "data": { + "resultType": "vector", + "result": [ + { + "metric": {"__name__": "up", "cluster": "test-cluster"}, + "value": [1234567890, "1"] + } + ] + } + } + mock_post.return_value = mock_response + + # Mock authenticated headers + with patch.object(self.toolset, '_get_authenticated_headers') as mock_headers: + mock_headers.return_value = {"Authorization": "Bearer test-token"} + + # Execute tool + params = { + "query": original_query, + "description": "Check if services are up", + "auto_cluster_filter": True + } + result = self.tool._invoke(params) + + # Verify result + assert result.status == ToolResultStatus.SUCCESS + result_data = json.loads(result.data) + assert result_data["status"] == "success" + assert result_data["query"] == enhanced_query + assert result_data["cluster_name"] == "test-cluster" + assert result_data["auto_cluster_filter_applied"] is True + assert "data" in result_data + + # Verify HTTP call + mock_post.assert_called_once() + call_args = mock_post.call_args + assert "api/v1/query" in call_args[1]["url"] + assert call_args[1]["data"]["query"] == enhanced_query + + @patch('requests.post') + def test_success_query_no_data(self, mock_post): + """Test successful query execution but no data returned.""" + # Mock HTTP response with no data + mock_response = Mock() + mock_response.status_code = 200 + mock_response.json.return_value = { + "status": "success", + "data": { + "resultType": "vector", + "result": [] + } + } + mock_post.return_value = mock_response + + # Mock authenticated headers + with patch.object(self.toolset, '_get_authenticated_headers') as mock_headers: + mock_headers.return_value = {"Authorization": "Bearer test-token"} + + # Execute tool + params = { + "query": "nonexistent_metric", + "description": "Query for nonexistent metric", + "auto_cluster_filter": False + } + result = self.tool._invoke(params) + + # Verify result + assert result.status == ToolResultStatus.NO_DATA + result_data = json.loads(result.data) + assert result_data["status"] == "no_data" + assert "no results" in result_data["error_message"] + + @patch('requests.post') + def test_failure_http_error(self, mock_post): + """Test failure due to HTTP error.""" + # Mock HTTP error response + mock_response = Mock() + mock_response.status_code = 401 + mock_response.text = "Unauthorized" + mock_post.return_value = mock_response + + # Mock authenticated headers + with patch.object(self.toolset, '_get_authenticated_headers') as mock_headers: + mock_headers.return_value = {"Authorization": "Bearer invalid-token"} + + # Execute tool + params = { + "query": "up", + "description": "Test query" + } + result = self.tool._invoke(params) + + # Verify result + assert result.status == ToolResultStatus.ERROR + assert "HTTP 401: Unauthorized" in result.error + + @patch('requests.post') + def test_failure_prometheus_error(self, mock_post): + """Test failure due to Prometheus query error.""" + # Mock Prometheus error response + mock_response = Mock() + mock_response.status_code = 200 + mock_response.json.return_value = { + "status": "error", + "error": "invalid PromQL syntax" + } + mock_post.return_value = mock_response + + # Mock authenticated headers + with patch.object(self.toolset, '_get_authenticated_headers') as mock_headers: + mock_headers.return_value = {"Authorization": "Bearer test-token"} + + # Execute tool + params = { + "query": "invalid{query", + "description": "Invalid PromQL query" + } + result = self.tool._invoke(params) + + # Verify result + assert result.status == ToolResultStatus.ERROR + result_data = json.loads(result.data) + assert result_data["status"] == "error" + assert "invalid PromQL syntax" in result_data["error_message"] + + def test_failure_workspace_not_configured(self): + """Test failure when Azure Monitor workspace is not configured.""" + # Create toolset without workspace configuration + toolset = AzureMonitorMetricsToolset() + toolset.config = AzureMonitorMetricsConfig() # No workspace endpoint + tool = ExecuteAzureMonitorPrometheusQuery(toolset) + + # Execute tool + params = { + "query": "up", + "description": "Test query" + } + result = tool._invoke(params) + + # Verify result + assert result.status == ToolResultStatus.ERROR + assert "Azure Monitor workspace is not configured" in result.error + + @patch('requests.post') + def test_failure_connection_error(self, mock_post): + """Test failure due to connection error.""" + # Mock connection error + mock_post.side_effect = RequestException("Connection timeout") + + # Mock authenticated headers + with patch.object(self.toolset, '_get_authenticated_headers') as mock_headers: + mock_headers.return_value = {"Authorization": "Bearer test-token"} + + # Execute tool + params = { + "query": "up", + "description": "Test query" + } + result = self.tool._invoke(params) + + # Verify result + assert result.status == ToolResultStatus.ERROR + assert "Connection error to Azure Monitor workspace" in result.error + assert "Connection timeout" in result.error + + +class TestGetActivePrometheusAlerts: + """Tests for GetActivePrometheusAlerts tool.""" + + def setup_method(self): + """Set up test fixtures.""" + self.toolset = AzureMonitorMetricsToolset() + self.toolset.config = AzureMonitorMetricsConfig( + cluster_resource_id="/subscriptions/test-sub/resourceGroups/test-rg/providers/Microsoft.ContainerService/managedClusters/test-cluster" + ) + self.tool = GetActivePrometheusAlerts(self.toolset) + + @patch('holmes.plugins.sources.azuremonitoralerts.AzureMonitorAlertsSource') + @patch('holmes.plugins.toolsets.azuremonitor_metrics.azuremonitor_metrics_toolset.extract_cluster_name_from_resource_id') + def test_success_multiple_alerts(self, mock_extract_name, mock_source_class): + """Test successful retrieval of multiple Prometheus alerts.""" + # Mock cluster name extraction + mock_extract_name.return_value = "test-cluster" + + # Mock alert source and issues + mock_source = Mock() + mock_source_class.return_value = mock_source + + # Create mock issues + mock_issue1 = Mock() + mock_issue1.id = "/subscriptions/test-sub/resourceGroups/test-rg/providers/Microsoft.AlertsManagement/alerts/alert-1" + mock_issue1.name = "High CPU Usage" + mock_issue1.raw = { + "alert": { + "properties": { + "essentials": { + "alertRule": "/subscriptions/test-sub/resourceGroups/test-rg/providers/Microsoft.Insights/prometheusRuleGroups/test-rules/rules/cpu-high", + "description": "CPU usage is above 80%", + "severity": "Sev1", + "alertState": "New", + "monitorCondition": "Fired", + "firedDateTime": "2023-01-01T12:00:00Z", + "targetResource": "/subscriptions/test-sub/resourceGroups/test-rg/providers/Microsoft.ContainerService/managedClusters/test-cluster" + } + } + }, + "rule_details": {}, + "extracted_query": "cpu_usage > 0.8", + "extracted_description": "Alert when CPU usage exceeds 80%" + } + + mock_issue2 = Mock() + mock_issue2.id = "/subscriptions/test-sub/resourceGroups/test-rg/providers/Microsoft.AlertsManagement/alerts/alert-2" + mock_issue2.name = "Memory Pressure" + mock_issue2.raw = { + "alert": { + "properties": { + "essentials": { + "alertRule": "/subscriptions/test-sub/resourceGroups/test-rg/providers/Microsoft.Insights/prometheusRuleGroups/test-rules/rules/memory-high", + "description": "Memory usage is above 90%", + "severity": "Sev0", + "alertState": "New", + "monitorCondition": "Fired", + "firedDateTime": "2023-01-01T12:05:00Z", + "targetResource": "/subscriptions/test-sub/resourceGroups/test-rg/providers/Microsoft.ContainerService/managedClusters/test-cluster" + } + } + }, + "rule_details": {}, + "extracted_query": "memory_usage > 0.9", + "extracted_description": "Alert when memory usage exceeds 90%" + } + + mock_source.fetch_issues.return_value = [mock_issue1, mock_issue2] + + # Execute tool + params = {} + result = self.tool._invoke(params) + + # Verify result + assert result.status == ToolResultStatus.SUCCESS + assert "Successfully found 2 active Prometheus alerts" in result.data + assert "IMPORTANT: The complete alert details" in result.data + + # Verify source was called correctly + mock_source_class.assert_called_once_with( + cluster_resource_id="/subscriptions/test-sub/resourceGroups/test-rg/providers/Microsoft.ContainerService/managedClusters/test-cluster" + ) + mock_source.fetch_issues.assert_called_once() + + @patch('holmes.plugins.sources.azuremonitoralerts.AzureMonitorAlertsSource') + @patch('holmes.plugins.toolsets.azuremonitor_metrics.azuremonitor_metrics_toolset.extract_cluster_name_from_resource_id') + def test_success_specific_alert(self, mock_extract_name, mock_source_class): + """Test successful retrieval of specific alert by ID.""" + # Mock cluster name extraction + mock_extract_name.return_value = "test-cluster" + + # Mock alert source + mock_source = Mock() + mock_source_class.return_value = mock_source + + # Create mock issue + mock_issue = Mock() + mock_issue.id = "alert-1" + mock_issue.name = "High CPU Usage" + mock_issue.raw = { + "alert": { + "properties": { + "essentials": { + "alertRule": "/subscriptions/test-sub/resourceGroups/test-rg/providers/Microsoft.Insights/prometheusRuleGroups/test-rules/rules/cpu-high", + "description": "CPU usage is above 80%", + "severity": "Sev1", + "alertState": "New", + "monitorCondition": "Fired", + "firedDateTime": "2023-01-01T12:00:00Z", + "targetResource": "/subscriptions/test-sub/resourceGroups/test-rg/providers/Microsoft.ContainerService/managedClusters/test-cluster" + } + } + }, + "rule_details": {}, + "extracted_query": "cpu_usage > 0.8", + "extracted_description": "Alert when CPU usage exceeds 80%" + } + + mock_source.fetch_issue.return_value = mock_issue + + # Execute tool with specific alert ID + params = {"alert_id": "alert-1"} + result = self.tool._invoke(params) + + # Verify result + assert result.status == ToolResultStatus.SUCCESS + assert "Successfully found 1 active Prometheus alerts" in result.data + + # Verify source was called correctly + mock_source.fetch_issue.assert_called_once_with("alert-1") + + @patch('holmes.plugins.sources.azuremonitoralerts.AzureMonitorAlertsSource') + @patch('holmes.plugins.toolsets.azuremonitor_metrics.azuremonitor_metrics_toolset.extract_cluster_name_from_resource_id') + def test_success_no_alerts_found(self, mock_extract_name, mock_source_class): + """Test successful execution but no alerts found.""" + # Mock cluster name extraction + mock_extract_name.return_value = "test-cluster" + + # Mock alert source with no alerts + mock_source = Mock() + mock_source_class.return_value = mock_source + mock_source.fetch_issues.return_value = [] + + # Execute tool + params = {} + result = self.tool._invoke(params) + + # Verify result + assert result.status == ToolResultStatus.NO_DATA + assert "No active Prometheus metric alerts found" in result.data + assert "test-cluster" in result.data + + @patch('holmes.plugins.toolsets.azuremonitor_metrics.azuremonitor_metrics_toolset.get_aks_cluster_resource_id') + def test_failure_no_cluster_specified(self, mock_get_resource_id): + """Test failure when no cluster is specified or auto-detected.""" + # Mock auto-detection to return None + mock_get_resource_id.return_value = None + + # Create toolset without cluster configuration + toolset = AzureMonitorMetricsToolset() + toolset.config = AzureMonitorMetricsConfig() + tool = GetActivePrometheusAlerts(toolset) + + # Execute tool + result = tool._invoke({}) + + # Verify result + assert result.status == ToolResultStatus.ERROR + assert "No AKS cluster specified" in result.error + + @patch('holmes.plugins.sources.azuremonitoralerts.AzureMonitorAlertsSource') + def test_failure_source_plugin_error(self, mock_source_class): + """Test failure due to source plugin error.""" + # Mock source to raise exception + mock_source_class.side_effect = Exception("Azure authentication failed") + + # Execute tool + params = {} + result = self.tool._invoke(params) + + # Verify result + assert result.status == ToolResultStatus.ERROR + assert "Failed to fetch alerts using source plugin" in result.error + assert "Azure authentication failed" in result.error + + +class TestExecuteAzureMonitorPrometheusRangeQuery: + """Tests for ExecuteAzureMonitorPrometheusRangeQuery tool.""" + + def setup_method(self): + """Set up test fixtures.""" + self.toolset = AzureMonitorMetricsToolset() + self.toolset.config = AzureMonitorMetricsConfig( + azure_monitor_workspace_endpoint="https://test-workspace.prometheus.monitor.azure.com/", + cluster_name="test-cluster", + tool_calls_return_data=True, + default_step_seconds=3600, + min_step_seconds=60, + max_data_points=1000 + ) + self.tool = ExecuteAzureMonitorPrometheusRangeQuery(self.toolset) + + @patch('holmes.plugins.toolsets.azuremonitor_metrics.azuremonitor_metrics_toolset.enhance_promql_with_cluster_filter') + @patch('holmes.plugins.toolsets.azuremonitor_metrics.azuremonitor_metrics_toolset.process_timestamps_to_rfc3339') + @patch('requests.post') + def test_success_range_query(self, mock_post, mock_process_timestamps, mock_enhance_query): + """Test successful PromQL range query execution.""" + # Mock query enhancement and timestamp processing + original_query = "rate(cpu_usage[5m])" + enhanced_query = 'rate(cpu_usage{cluster="test-cluster"}[5m])' + mock_enhance_query.return_value = enhanced_query + mock_process_timestamps.return_value = ("2023-01-01T11:00:00Z", "2023-01-01T12:00:00Z") + + # Mock HTTP response + mock_response = Mock() + mock_response.status_code = 200 + mock_response.json.return_value = { + "status": "success", + "data": { + "resultType": "matrix", + "result": [ + { + "metric": {"__name__": "cpu_usage", "cluster": "test-cluster"}, + "values": [ + [1672574400, "0.5"], + [1672578000, "0.6"] + ] + } + ] + } + } + mock_post.return_value = mock_response + + # Mock authenticated headers + with patch.object(self.toolset, '_get_authenticated_headers') as mock_headers: + mock_headers.return_value = {"Authorization": "Bearer test-token"} + + # Execute tool + params = { + "query": original_query, + "description": "CPU usage rate over time", + "start": "2023-01-01T11:00:00Z", + "end": "2023-01-01T12:00:00Z", + "step": 3600, + "output_type": "Percentage", + "auto_cluster_filter": True + } + result = self.tool._invoke(params) + + # Verify result + assert result.status == ToolResultStatus.SUCCESS + result_data = json.loads(result.data) + assert result_data["status"] == "success" + assert result_data["query"] == enhanced_query + assert result_data["step"] == 3600 + assert result_data["output_type"] == "Percentage" + assert result_data["auto_cluster_filter_applied"] is True + + # Verify HTTP call + mock_post.assert_called_once() + call_args = mock_post.call_args + assert "api/v1/query_range" in call_args[1]["url"] + assert call_args[1]["data"]["query"] == enhanced_query + assert call_args[1]["data"]["step"] == 3600 + + @patch('holmes.plugins.toolsets.azuremonitor_metrics.azuremonitor_metrics_toolset.process_timestamps_to_rfc3339') + def test_step_size_calculation(self, mock_process_timestamps): + """Test optimal step size calculation.""" + # Mock timestamp processing for 24-hour range + mock_process_timestamps.return_value = ("2023-01-01T00:00:00Z", "2023-01-02T00:00:00Z") + + # Test step size calculation + params = { + "query": "up", + "description": "Test query", + "output_type": "Plain" + } + step = self.tool._calculate_optimal_step_size(params, "2023-01-01T00:00:00Z", "2023-01-02T00:00:00Z") + + # For 24-hour range, should use default step (3600s) + assert step == 3600 + + @patch('holmes.plugins.toolsets.azuremonitor_metrics.azuremonitor_metrics_toolset.process_timestamps_to_rfc3339') + def test_step_size_with_user_override(self, mock_process_timestamps): + """Test step size calculation with user-provided step.""" + # Mock timestamp processing + mock_process_timestamps.return_value = ("2023-01-01T00:00:00Z", "2023-01-01T01:00:00Z") + + # Test with user-provided step + params = { + "query": "up", + "description": "Test query", + "step": 300, # 5 minutes + "output_type": "Plain" + } + step = self.tool._calculate_optimal_step_size(params, "2023-01-01T00:00:00Z", "2023-01-01T01:00:00Z") + + # Should use user-provided step + assert step == 300 + + @patch('holmes.plugins.toolsets.azuremonitor_metrics.azuremonitor_metrics_toolset.process_timestamps_to_rfc3339') + def test_step_size_minimum_enforcement(self, mock_process_timestamps): + """Test that minimum step size is enforced.""" + # Mock timestamp processing + mock_process_timestamps.return_value = ("2023-01-01T00:00:00Z", "2023-01-01T00:05:00Z") + + # Test with step below minimum + params = { + "query": "up", + "description": "Test query", + "step": 30, # 30 seconds, below min of 60 + "output_type": "Plain" + } + step = self.tool._calculate_optimal_step_size(params, "2023-01-01T00:00:00Z", "2023-01-01T00:05:00Z") + + # Should enforce minimum step size + assert step == 60 + + def test_failure_workspace_not_configured(self): + """Test failure when Azure Monitor workspace is not configured.""" + # Create toolset without workspace configuration + toolset = AzureMonitorMetricsToolset() + toolset.config = AzureMonitorMetricsConfig() # No workspace endpoint + tool = ExecuteAzureMonitorPrometheusRangeQuery(toolset) + + # Execute tool + params = { + "query": "up", + "description": "Test query", + "output_type": "Plain" + } + result = tool._invoke(params) + + # Verify result + assert result.status == ToolResultStatus.ERROR + assert "Azure Monitor workspace is not configured" in result.error + + +class TestExecuteAlertPromQLQuery: + """Tests for ExecuteAlertPromQLQuery tool.""" + + def setup_method(self): + """Set up test fixtures.""" + self.toolset = AzureMonitorMetricsToolset() + self.toolset.config = AzureMonitorMetricsConfig( + azure_monitor_workspace_endpoint="https://test-workspace.prometheus.monitor.azure.com/", + cluster_name="test-cluster", + tool_calls_return_data=True + ) + self.tool = ExecuteAlertPromQLQuery(self.toolset) + + @patch('holmes.plugins.sources.azuremonitoralerts.AzureMonitorAlertsSource') + @patch('requests.post') + def test_success_execute_alert_query(self, mock_post, mock_source_class): + """Test successful execution of alert's PromQL query.""" + # Mock alert source and issue + mock_source = Mock() + mock_source_class.return_value = mock_source + + mock_issue = Mock() + mock_issue.id = "alert-1" + mock_issue.name = "High CPU Usage" + mock_issue.raw = { + "extracted_query": "cpu_usage > 0.8" + } + mock_source.fetch_issue.return_value = mock_issue + + # Mock HTTP response + mock_response = Mock() + mock_response.status_code = 200 + mock_response.json.return_value = { + "status": "success", + "data": { + "resultType": "matrix", + "result": [ + { + "metric": {"__name__": "cpu_usage"}, + "values": [[1672574400, "0.9"]] + } + ] + } + } + mock_post.return_value = mock_response + + # Mock authenticated headers + with patch.object(self.toolset, '_get_authenticated_headers') as mock_headers: + mock_headers.return_value = {"Authorization": "Bearer test-token"} + + # Execute tool + params = { + "alert_id": "alert-1", + "time_range": "1h" + } + result = self.tool._invoke(params) + + # Verify result + assert result.status == ToolResultStatus.SUCCESS + result_data = json.loads(result.data) + assert result_data["status"] == "success" + assert result_data["alert_id"] == "alert-1" + assert result_data["alert_name"] == "High CPU Usage" + assert result_data["extracted_query"] == "cpu_usage > 0.8" + assert result_data["time_range"] == "1h" + + # Verify HTTP call was made to query_range endpoint + mock_post.assert_called_once() + call_args = mock_post.call_args + assert "api/v1/query_range" in call_args[1]["url"] + assert call_args[1]["data"]["query"] == "cpu_usage > 0.8" + + @patch('holmes.plugins.sources.azuremonitoralerts.AzureMonitorAlertsSource') + def test_failure_alert_not_found(self, mock_source_class): + """Test failure when alert is not found.""" + # Mock alert source to return None + mock_source = Mock() + mock_source_class.return_value = mock_source + mock_source.fetch_issue.return_value = None + + # Execute tool + params = { + "alert_id": "nonexistent-alert", + "time_range": "1h" + } + result = self.tool._invoke(params) + + # Verify result + assert result.status == ToolResultStatus.ERROR + assert "Alert nonexistent-alert not found" in result.error + + @patch('holmes.plugins.sources.azuremonitoralerts.AzureMonitorAlertsSource') + def test_failure_no_query_extracted(self, mock_source_class): + """Test failure when no PromQL query can be extracted from alert.""" + # Mock alert source and issue without query + mock_source = Mock() + mock_source_class.return_value = mock_source + + mock_issue = Mock() + mock_issue.id = "alert-1" + mock_issue.name = "Test Alert" + mock_issue.raw = { + "extracted_query": "Query not available" + } + mock_source.fetch_issue.return_value = mock_issue + + # Execute tool + params = { + "alert_id": "alert-1", + "time_range": "1h" + } + result = self.tool._invoke(params) + + # Verify result + assert result.status == ToolResultStatus.ERROR + assert "Could not extract PromQL query from alert" in result.error + + def test_parse_time_range_valid_formats(self): + """Test time range parsing with valid formats.""" + assert self.tool._parse_time_range("1h") == 3600 + assert self.tool._parse_time_range("30m") == 1800 + assert self.tool._parse_time_range("24h") == 86400 + assert self.tool._parse_time_range("1d") == 86400 + assert self.tool._parse_time_range("120s") == 120 + + def test_parse_time_range_invalid_formats(self): + """Test time range parsing with invalid formats.""" + assert self.tool._parse_time_range("invalid") is None + assert self.tool._parse_time_range("1x") is None + assert self.tool._parse_time_range("") is None + assert self.tool._parse_time_range("1.5h") is None