Skip to content

Harden autonomous-skill runner for resume, task isolation, and long-running sessions#3

Open
jiangqilong41-ops wants to merge 1 commit intofeiskyer:mainfrom
jiangqilong41-ops:codex/autonomous-skill-hardening
Open

Harden autonomous-skill runner for resume, task isolation, and long-running sessions#3
jiangqilong41-ops wants to merge 1 commit intofeiskyer:mainfrom
jiangqilong41-ops:codex/autonomous-skill-hardening

Conversation

@jiangqilong41-ops
Copy link

Summary

This PR hardens skills/autonomous-skill for real long-running, resumable workflows.

It fixes several correctness and operability issues in run-session.sh, updates the docs/prompts to match the actual behavior, and adds a regression script covering the most failure-prone paths.

What changed

  • resolve .autonomous/ under the workspace root
  • make task name generation safe for non-ASCII descriptions
  • validate runner arguments early
  • add per-task runner locking
  • add resume fallback when codex exec resume fails
  • stop inlining full task_list.md / progress.md into the CLI prompt
  • verify initializer output before declaring success
  • fix progress counting so only real Task N: items are counted
  • improve recent-task selection to ignore invalid entries
  • update docs and executor prompt to match actual runner behavior
  • add a regression script for the key failure paths

Validation

Ran locally:

  • bash -n skills/autonomous-skill/scripts/run-session.sh
  • bash -n skills/autonomous-skill/tests/run-session-regression.sh
  • bash skills/autonomous-skill/tests/run-session-regression.sh

Regression result:

  • Passed: 31
  • Failed: 0

@feiskyer
Copy link
Owner

Thanks for the enhancements. rm -rf is super dangerous, is it possible to avoid using it?

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the skills/autonomous-skill session runner to better support long-running, resumable workflows by improving task isolation, state validation, and resume behavior, plus aligning prompts/docs with the updated runner behavior.

Changes:

  • Hardened run-session.sh with workspace-root .autonomous resolution, safer task-name generation/validation, per-task locking, resume fallback, and stricter task state validation/progress parsing.
  • Updated executor prompt and SKILL documentation to reflect the runner’s expected task formatting and operational guidance.
  • Added a bash regression test script covering key failure-prone paths (resume fallback, invalid state detection, task name edge cases, stale lock recovery).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
skills/autonomous-skill/scripts/run-session.sh Core runner hardening: workspace root detection, locking, resume fallback, prompt rendering changes, and task_list/progress validation.
skills/autonomous-skill/templates/executor-prompt.md Aligns task counting and operational guidance with the stricter task parsing/runner behavior.
skills/autonomous-skill/SKILL.md Updates usage and documentation to match workspace-root behavior and safer git hygiene guidance.
skills/autonomous-skill/tests/run-session-regression.sh Adds regression coverage for runner edge cases and common failure modes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

local case_name="stale-lock-recovery"
make_fake_codex "$TEST_ROOT/$case_name/bin" "success-with-files"
mkdir -p "$TEST_ROOT/$case_name/work/.autonomous/demo/.runner.lock"
printf '999999\n' > "$TEST_ROOT/$case_name/work/.autonomous/demo/.runner.lock/pid"
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stale-lock test hardcodes PID 999999. On some systems (or long-lived CI hosts) that PID could legitimately exist, causing the runner to treat the lock as active and making the test flaky. Consider choosing a PID guaranteed to be unused (e.g., probe with kill -0 in a loop to find a free PID) before writing the lock pid file.

Suggested change
printf '999999\n' > "$TEST_ROOT/$case_name/work/.autonomous/demo/.runner.lock/pid"
# Choose a PID that is guaranteed to be unused so the lock is truly stale.
local fake_pid
while :; do
# Pick a PID-like number in a typical user-space range.
fake_pid=$(( (RANDOM % 64512) + 1000 ))
# If no process has this PID, use it.
if ! kill -0 "$fake_pid" 2>/dev/null; then
break
fi
done
printf '%s\n' "$fake_pid" > "$TEST_ROOT/$case_name/work/.autonomous/demo/.runner.lock/pid"

Copilot uses AI. Check for mistakes.
echo ""
echo "Options:"
echo " --task-name <name> Specify task name explicitly"
echo " --task-name <name> Specify task name explicitly (must start with A-Z a-z 0-9; safe chars: . _ -)"
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--task-name help text says the name "must start with A-Z a-z 0-9", but validate_task_name currently allows leading underscores (and any allowed character except -/.). Align the help text with the actual validation, or tighten validation to require an alphanumeric first character to avoid confusing users and inconsistent task directory names.

Suggested change
echo " --task-name <name> Specify task name explicitly (must start with A-Z a-z 0-9; safe chars: . _ -)"
echo " --task-name <name> Specify task name explicitly (must not start with '.' or '-'; allowed chars: A-Z a-z 0-9 _ . -)"

Copilot uses AI. Check for mistakes.
Comment on lines 214 to 220
CODEX_ARGS=(
-c 'approval_policy="never"'
-c "sandbox_mode=\"$sandbox_mode\""
--skip-git-repo-check
--json
-C "$EXECUTION_CWD"
)
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_codex_args adds a -C "$EXECUTION_CWD" flag to every codex exec invocation. Since this flag is not used anywhere else in the repo, if a user's Codex CLI doesn't support -C the runner will fail immediately. Consider avoiding reliance on a nonstandard flag (e.g., run codex in a subshell after cd "$EXECUTION_CWD") or add a compatibility check/fallback.

Copilot uses AI. Check for mistakes.
Comment on lines +546 to +548
LAST_CODEX_EXIT=${PIPESTATUS[0]}
set -e

Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_codex_capture captures only ${PIPESTATUS[0]} (Codex exit code) and ignores failures from tee/log persistence. If tee partially fails (e.g., disk full) Codex may appear successful while logs/session ID extraction are incomplete, which undermines resumability. Consider checking PIPESTATUS[1] as well (and failing the session) or writing output to the temp log without tee and then streaming it separately.

Suggested change
LAST_CODEX_EXIT=${PIPESTATUS[0]}
set -e
local codex_exit=${PIPESTATUS[0]}
local tee_exit=${PIPESTATUS[1]}
LAST_CODEX_EXIT=$codex_exit
set -e
if [ "$tee_exit" -ne 0 ]; then
print_error "Log persistence failed (tee exit code: $tee_exit) for task directory '$task_dir'"
return "$tee_exit"
fi

Copilot uses AI. Check for mistakes.

```bash
git add .
git add {TASK_DIR}/task_list.md {TASK_DIR}/progress.md [project-files...]
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The git add ... [project-files...] line looks like a literal command inside a bash code block, but [project-files...] is not valid shell syntax and will break if copy/pasted. Consider converting it into a comment placeholder (e.g., # add your project files here) or providing a concrete example list of paths.

Suggested change
git add {TASK_DIR}/task_list.md {TASK_DIR}/progress.md [project-files...]
git add {TASK_DIR}/task_list.md {TASK_DIR}/progress.md src/ tests/

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants