Skip to content

🚨 P0: Metrics Collection Infrastructure Not Operational #7687

@github-actions

Description

@github-actions

Problem

The Metrics Collector workflow infrastructure is not producing expected output, preventing all meta-orchestrators from performing health analysis.

Missing Data

  1. Latest metrics file not found: /tmp/gh-aw/repo-memory-default/memory/meta-orchestrators/metrics/latest.json
  2. Historical metrics unavailable: /tmp/gh-aw/repo-memory-default/memory/meta-orchestrators/metrics/daily/*.json
  3. Repo memory access denied: Permission issues accessing shared memory paths

Impact

All meta-orchestrators affected:

  • Workflow Health Manager - Cannot assess workflow success rates or detect failures
  • Agent Performance Analyzer - Cannot analyze agent quality trends
  • Campaign Manager - Cannot track campaign health metrics
  • ❌ Other workflows depending on shared metrics infrastructure

Without metrics data, we cannot:

  • Detect failing workflows proactively
  • Calculate success rates or MTBF
  • Identify error patterns
  • Track performance trends
  • Make data-driven optimization decisions

Root Cause Analysis Needed

Possible Issues

  1. Metrics Collector workflow failing

    • Not running on schedule (daily)
    • Encountering errors during execution
    • Timeout or resource constraints
  2. Repo memory configuration

    • Branch memory/meta-orchestrators not accessible
    • Permission issues on repo-memory tool
    • File path or glob pattern misconfiguration
  3. File system permissions

    • /tmp/gh-aw/repo-memory-default/ permissions incorrect
    • Memory mount not working in workflow environment

Investigation Steps

  1. Check Metrics Collector status

    gh run list --workflow=metrics-collector.md --limit 10
  2. Review recent run logs

    gh run view (run-id) --log
  3. Verify repo-memory branch

    git ls-remote origin memory/meta-orchestrators
  4. Test repo-memory access

    • Run simple workflow that writes to repo-memory
    • Verify files are committed to branch

Expected Metrics Format

The Metrics Collector should produce:

latest.json:

{
  "timestamp": "2025-12-26T00:00:00Z",
  "workflows": {
    "workflow-name": {
      "total_runs": 10,
      "successful_runs": 8,
      "failed_runs": 2,
      "success_rate": 0.80,
      "avg_duration_seconds": 120
    }
  }
}

daily/YYYY-MM-DD.json: Same format, one per day for 30 days

Recommended Fix

  1. Verify Metrics Collector workflow is running successfully
  2. Fix repo-memory permissions if access is blocked
  3. Update metrics collection if format changed
  4. Document metrics schema for consistency across meta-orchestrators

Priority Justification

P0 (Critical) because:

  • Blocks all meta-orchestrator health monitoring
  • Prevents proactive failure detection across 124 workflows
  • No workaround available - metrics are foundation for health assessment
  • Affects entire agentic workflow ecosystem reliability

Success Criteria

✅ Metrics Collector runs successfully on daily schedule
✅ latest.json appears in expected location
✅ Historical daily metrics available for 30-day analysis
✅ Workflow Health Manager can access and parse metrics
✅ All meta-orchestrators resume normal operation


Discovered by: Workflow Health Manager
Run ID: 20514768306
Date: 2025-12-26 02:53 UTC

AI generated by Workflow Health Manager - Meta-Orchestrator

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions