Skip to content

APM cache key collides across reusable workflows when called from a downstream repo #30365

@theletterf

Description

@theletterf

Problem

When a reusable workflow that imports skills via shared/apm.md is called from a downstream repo (i.e., the lock file lives in org/library, but the workflow runs in org/caller), the APM cache key collapses to a constant value across all APM-importing workflows in the library. As soon as the library has more than one such workflow with different packages:, they overwrite each other's cached bundles, and the consumer that lost the race ends up with Skill not found at runtime.

What the lock file generates

- id: apm_cache
  uses: actions/cache/save@…
  with:
    key: apm-${{ needs.activation.outputs.engine_id }}-${{ hashFiles('.github/workflows/*.lock.yml') }}
    path: /tmp/gh-aw/apm-workspace

hashFiles('.github/workflows/*.lock.yml') is evaluated against the caller's workspace — and most callers don't carry the library's lock files. They simply do:

jobs:
  review:
    uses: org/library/.github/workflows/foo.lock.yml@v1

so hashFiles(...) returns "" and the key resolves to literally apm-<engine>- with an empty trailing segment. Every APM-importing workflow in the library gets that same key.

Reproduction (real)

In elastic/docs-actions we have several reusable workflows that import skills via APM (docs-review, docs-frontmatter-sweep, docs-applies-to-sweep, docs-openings-sweep, docs-style-sweep). Each has a distinct packages: list. They are all invoked from elastic/docs-content, which has no *.lock.yml files in its own .github/workflows/.

In every run we observe (apm job, agent job, both):

key: apm-copilot-
Cache hit for: apm-copilot-
Cache restored from key: apm-copilot-

While only one APM-importing workflow existed, this was benign — same producer, same consumer, same bundle. After we added several more, the cached bundle on apm-copilot- is whichever workflow saved last. When the next workflow's agent job extracts that bundle, the skills it actually needs aren't there:

✗ skill(docs-check-style) Skill not found: docs-check-style
✗ skill(docs-flag-jargon-skill) Skill not found: docs-flag-jargon-skill
✗ skill(docs-frontmatter-audit) Skill not found: docs-frontmatter-audit
✗ skill(docs-content-type-checker) Skill not found: docs-content-type-checker
✗ skill(docs-applies-to-tagging) Skill not found: docs-applies-to-tagging

(Run: https://github.com/elastic/docs-content/actions/runs/25379248158, Docs AI / docs review / agent job — but you can see the same apm-copilot- key in any of our workflow runs.)

A docs-content PR run from late April with only one APM-importing workflow (docs-review, the same workflow currently failing) used the identical apm-copilot- key and worked, because no other workflow was overwriting that cache entry. So the regression is purely additive — adding a second APM-importing reusable workflow with a different package list silently breaks the first.

Suggested fix

Make the cache key reflect the bundle contents, not the caller's filesystem. Hashing AW_APM_PACKAGES (the inlined package list in the lock file) would be both correct and sufficient:

key: apm-${{ needs.activation.outputs.engine_id }}-${{ hashFiles('.github/workflows/*.lock.yml') || '' }}-${{ hashFiles_or_hash(AW_APM_PACKAGES) }}

Or simpler: include ${{ github.workflow }} (or a stable workflow_id if available) as a discriminator. Either approach prevents two workflows with different package lists from sharing a cache slot.

Workaround we are applying meanwhile

We are aligning every APM-importing workflow's packages: to the union of all skills, so that all five workflows pack the same bundle and cache collisions become benign. This works but is a maintenance tax — every new skill anywhere has to be added to every workflow.

Versions

  • gh-aw: v0.71.1 (and v0.71.0, v0.71.4 — same key formula, all affected)
  • caller: elastic/docs-content
  • library: elastic/docs-actions

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions