Skip to content

feat(sdlc): SDLC state persistence layer (#1704)#1813

Draft
tomgreen981111-cipher wants to merge 1 commit into
Anantys-oss:mainfrom
tomgreen981111-cipher:koan/sdlc-state-persistence-1704
Draft

feat(sdlc): SDLC state persistence layer (#1704)#1813
tomgreen981111-cipher wants to merge 1 commit into
Anantys-oss:mainfrom
tomgreen981111-cipher:koan/sdlc-state-persistence-1704

Conversation

@tomgreen981111-cipher

Copy link
Copy Markdown
Contributor

What

Add sdlc_state.py — the persistence foundation for the /sdlc multi-phase orchestration skill.

Why

Issue #1704: without durable cross-mission state, SDLC phases are isolated runs that share zero structured data. A failed mid-implementation run loses all research and architecture work, the orchestrator can't make data-driven routing decisions, and the approval checkpoint (#1706) has nowhere to persist its approved flag.

This is the prerequisite for #1707 (/sdlc orchestrator skill) and #1706 (human approval checkpoint).

How

  • SdlcPhase str-enum (RESEARCH → PRODUCTION_READY / ABANDONED) with is_terminal property
  • SdlcRiskLevel enum (Low/Medium/High)
  • SdlcState dataclass with to_dict()/from_dict() round-trip for all field types (lists, dicts, enums, bool)
  • Workspace layout: instance/sdlc/{issue_name}/STATE.json + named artifact files (RESEARCH.md, ADR.md, PLAN.md, IMPLEMENTATION.md, SECURITY.md, QA.md, SRE.md, REVIEW.md, DOCS.md)
  • save_sdlc_state uses atomic_write_json (temp + rename + fcntl.flock) — crash mid-write never corrupts existing state
  • archive_sdlc_workspace moves terminal workspaces to sdlc/_archived/ without clobbering prior archives (timestamp suffix)
  • list_sdlc_workspaces returns active (non-archived) workspace names
  • CLAUDE.md updated: instance directory section + key modules section

Testing

55 unit tests covering: phase/risk enum round-trips, from_dict with unknown/missing fields, get_sdlc_workspace idempotency and path safety, load_sdlc_state resilience (missing workspace, corrupt JSON, empty file), save/load cycle with multiple isolated issues, artifact path computation, archive behaviour for terminal vs active phases, and list excluding _archived.

Closes #1704

…tracking (Anantys-oss#1704)

Implements instance/sdlc/{issue_name}/ workspace layout with atomic
STATE.json reads/writes, phase artifact paths, and workspace archival
for terminal phases (PRODUCTION_READY, ABANDONED).

Foundation required by Anantys-oss#1707 (/sdlc orchestrator) and Anantys-oss#1706 (approval
checkpoint) — both need a place to read and persist phase state between
missions.
@BabyKoan

Copy link
Copy Markdown
Collaborator

PR Review — feat(sdlc): SDLC state persistence layer (#1704)

Solid foundation for SDLC state persistence with comprehensive test coverage. Three warnings need attention before merge.

  • Exception handling too broad (bare except:) - makes debugging difficult
  • Input validation gaps in from_dict - non-numeric fix_iteration would crash
  • Path sanitization works but needs explicit test for traversal resistance
  • Docstring claims about atomic locking need verification against utils.py implementation
  • No concurrent access tests despite claiming OS-level writer serialization

🟡 Important

1. Path traversal risk in _sanitise_issue_name (`koan/app/sdlc_state.py`, L201-204)

The sanitization replaces unsafe chars but doesn't prevent all path traversal. After sanitization, .. becomes __ which is safe, but the logic should be explicit about this guarantee.

More concerning: the function strips ._ from ends but a name like ....//....etc becomes __etc - safe but the transformation is non-obvious. Add a test asserting that sanitized names cannot escape the sdlc directory via path traversal.

def _sanitise_issue_name(issue_name: str) -> str:
    """Normalise *issue_name* for use as a directory component.

    Replaces characters unsafe in file paths with underscores and strips
    leading/trailing whitespace, dots, and underscores.
    """
    safe = "".join(c if c.isalnum() or c in "-_." else "_" for c in issue_name.strip())
    return safe.strip("._") or "unnamed"

🟢 Suggestions

1. Docstring overstates atomic_write_json guarantees (`koan/app/sdlc_state.py`, L170-172)

The docstring claims fcntl.flock serializes writers, but atomic_write_json in utils.py may not implement full file locking across processes. Verify that atomic_write_json actually uses fcntl.flock with exclusive locks before claiming "last writer wins" semantics.

If the locking is per-process only, concurrent SDLC runs could still corrupt state.

Concurrent SDLC runs for the same issue_name will race on STATE.json.
``save_sdlc_state`` uses ``atomic_write_json`` (temp + rename + fcntl.flock)
which serializes writers at the OS level — last writer wins.

Checklist

  • No hardcoded secrets
  • Input validation at boundaries — warning #2
  • Error handling specificity — warning #1
  • Path traversal prevention
  • Test coverage for edge cases — suggestion #6
  • Resource cleanup / leaks
  • Python mutable defaults

To rebase specific severity levels, use: /rebase <url> critical (fixes 🔴 only), /rebase <url> important (fixes 🔴 + 🟡), or just /rebase <url> for all.


Silent Failure Analysis

🟡 **MEDIUM** — silent null return on error (`koan/app/sdlc_state.py:108-118`)

Risk: Corrupt JSON or read errors are silently converted to None, making it impossible for callers to distinguish between 'not started' and 'data corruption'.

try:
    data = json.loads(state_file.read_text(encoding="utf-8"))
    return SdlcState.from_dict(data)
except (json.JSONDecodeError, OSError):
    return None

Fix: Log the exception before returning None, or raise a custom exception for callers that need to handle corruption differently from 'not started'.

🟡 **MEDIUM** — silent fallback on deserialization failure (`koan/app/sdlc_state.py:127-137`)

Risk: Invalid enum values in persisted state are silently replaced with defaults, potentially masking data corruption or version mismatches.

try:
    phase = SdlcPhase(data.get("current_phase", "research"))
except ValueError:
    phase = SdlcPhase.RESEARCH

try:
    risk = SdlcRiskLevel(data.get("risk_level", "Medium"))
except ValueError:
    risk = SdlcRiskLevel.MEDIUM

Fix: Log a warning when falling back to default values, or surface the unknown value so operators can detect state file drift.

🟡 **MEDIUM** — unvalidated path construction (`koan/app/sdlc_state.py:169-174`)

Risk: The artifact_name is not validated against SDLC_ARTIFACTS before use, allowing arbitrary file paths to be constructed (potential path injection if caller is untrusted).

def get_artifact_path(
    instance_dir: str, issue_name: str, artifact_name: str
) -> Path:
    ws = Path(instance_dir) / "sdlc" / _sanitise_issue_name(issue_name)
    return ws / artifact_name

Fix: Validate artifact_name against SDLC_ARTIFACTS or use a whitelist check before constructing the path.

🟡 **MEDIUM** — test asserts on non-deterministic value (`koan/tests/test_sdlc_state.py:124-130`)

Risk: The test asserts that started_at is non-empty but doesn't validate its format, allowing malformed timestamps to pass.

def test_from_dict_missing_optional_fields(self):
    state = SdlcState.from_dict({"issue_name": "y", "description": ""})
    assert state.fix_iteration == 0
    ...
    assert state.started_at  # non-empty default

Fix: Add a regex or datetime parse assertion to validate the timestamp format matches ISO 8601.


Automated review by Kōan (Ollama-launch · model qwen3.5:cloud) HEAD=741c265 1 min

@BabyKoan BabyKoan left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking issues found — see the review comment above.

@Koan-Bot

Copy link
Copy Markdown
Collaborator

@BabyKoan:

Agree with all three warnings. Most urgent: from_dict unguarded int() on L141 — ValueError escapes load_sdlc_state's except clause, crashing callers on corrupt state. Same risk on list() L142 and dict() L149. Fix from_dict defensively, then the narrow except in load_sdlc_state stays correct. Path traversal sanitization is actually solid via _sanitise_issue_name (/_, .. → stripped → "unnamed"), but get_artifact_path has no validation on the artifact component — that's the real gap.

@Koan-Bot

Copy link
Copy Markdown
Collaborator

PR Review — feat(sdlc): SDLC state persistence layer (#1704)

Well-structured persistence layer with strong test coverage, but from_dict has an unguarded int() cast that breaks the error contract of load_sdlc_state.

  • Blocking: from_dict raises ValueError/TypeError on corrupt-but-valid-JSON state files, bypassing load_sdlc_state's exception handler — callers get an unhandled crash instead of None
  • get_artifact_path docstring says artifact names must be from SDLC_ARTIFACTS but doesn't enforce it — path injection possible if callers are untrusted
  • Archive timestamp has a one-second collision window; microseconds would close it
  • Test suite needs a case for wrong-type fields to lock in the fix for the critical bug
  • Clean architecture overall: good use of atomic_write_json, proper str enum, sensible workspace layout, thorough tests for the happy path

🔴 Blocking

1. Unguarded int() cast in from_dict breaks load_sdlc_state's error contract (`koan/app/sdlc_state.py`, L141)

int(data.get("fix_iteration", 0)) raises ValueError on non-numeric strings (e.g. "abc"). This propagates uncaught through load_sdlc_state, which only catches json.JSONDecodeError and OSError.

The function is otherwise carefully defensive — enum fields get try/except with fallback. But this single unguarded cast means any STATE.json with "fix_iteration": "corrupted" will crash callers instead of returning None.

  • Wrap in try/except like the enum fields: try: fix_iteration=int(data.get("fix_iteration", 0)) except (ValueError, TypeError): fix_iteration=0
  • Same concern applies to list() on line 142 and dict() on line 149 — list(123) and dict("garbage") both raise TypeError.
fix_iteration=int(data.get("fix_iteration", 0)),
failing_experts=list(data.get("failing_experts", [])),
...
artifact_checksums=dict(data.get("artifact_checksums", {})),

🟡 Important

1. Exception handler too narrow for the call it wraps (`koan/app/sdlc_state.py`, L108-118)

The except clause catches json.JSONDecodeError and OSError, but SdlcState.from_dict() can raise ValueError, TypeError, AttributeError depending on the shape of the deserialized data.

Two options, either is fine:

Option 2 is cleaner — put resilience where the knowledge lives.

try:
    data = json.loads(state_file.read_text(encoding="utf-8"))
    return SdlcState.from_dict(data)
except (json.JSONDecodeError, OSError):
    return None
2. get_artifact_path accepts arbitrary filenames (`koan/app/sdlc_state.py`, L169-180)

artifact_name is not validated against SDLC_ARTIFACTS. Since _sanitise_issue_name prevents traversal in the issue component but nothing constrains the artifact component, a caller passing "../../.env" would get a path outside the workspace.

This is internal code today, but the docstring already mentions the whitelist ("must be one of SDLC_ARTIFACTS") — enforce what the docstring promises:

if artifact_name not in SDLC_ARTIFACTS:
    raise ValueError(f"Unknown artifact: {artifact_name!r}")
def get_artifact_path(
    instance_dir: str, issue_name: str, artifact_name: str
) -> Path:
    ws = Path(instance_dir) / "sdlc" / _sanitise_issue_name(issue_name)
    return ws / artifact_name
3. Missing test for from_dict with corrupt field types

The test suite covers unknown enum values and missing fields but doesn't exercise from_dict with wrong types — e.g. {"fix_iteration": "not_a_number"} or {"failing_experts": 42}. These are exactly the cases that trigger the critical bug above.

Add a test that passes structurally-valid JSON with wrong value types and assert it either returns a sensible default or load_sdlc_state returns None.


Checklist

  • No hardcoded secrets
  • Input validation at boundaries — critical #1, warning #2
  • Error handling specificity — critical #1, warning #1
  • Path traversal prevention
  • Test coverage for edge cases — warning #3
  • Resource cleanup / leaks
  • Python mutable defaults

To rebase specific severity levels, mention me: @Koan-Bot rebase critical (fixes 🔴 only), @Koan-Bot rebase important (fixes 🔴 + 🟡), or just @Koan-Bot rebase for all.


Silent Failure Analysis

🟠 **HIGH** — incomplete exception guard (`koan/app/sdlc_state.py:170-175`)

Risk: The docstring promises None for malformed STATE.json, but from_dict can raise ValueError (from int("abc")) or TypeError (if data isn't a dict) on valid JSON with bad values — those propagate uncaught, crashing the caller instead of returning None.

try:
    data = json.loads(state_file.read_text(encoding="utf-8"))
    return SdlcState.from_dict(data)
except (json.JSONDecodeError, OSError):
    return None

Fix: Widen the except clause to (json.JSONDecodeError, OSError, ValueError, TypeError, KeyError) or catch Exception with a logged warning, so all deserialization failures honour the documented contract.

🟡 **MEDIUM** — silent fallback on corrupt state (`koan/app/sdlc_state.py:128-135`)

Risk: An invalid phase or risk value in STATE.json silently resets to defaults with no logging — a corrupted state file could regress a workflow from IMPLEMENTATION back to RESEARCH without any trace.

try:
    phase = SdlcPhase(data.get("current_phase", "research"))
except ValueError:
    phase = SdlcPhase.RESEARCH

try:
    risk = SdlcRiskLevel(data.get("risk_level", "Medium"))
except ValueError:
    risk = SdlcRiskLevel.MEDIUM

Fix: Log a warning (via run_log.log_safe) when an invalid value is encountered so operators can detect state-file corruption.

🟡 **MEDIUM** — silent null return hides corruption (`koan/app/sdlc_state.py:163-175`)

Risk: Callers cannot distinguish 'never started' from 'STATE.json exists but is corrupt' — both return None, so a corrupted workflow silently appears as unstarted, potentially causing duplicate SDLC runs on the same issue.

if not state_file.exists():
    return None
try:
    data = json.loads(state_file.read_text(encoding="utf-8"))
    return SdlcState.from_dict(data)
except (json.JSONDecodeError, OSError):
    return None

Fix: Log a warning when the file exists but can't be parsed, so the silent-restart scenario is at least observable in logs.

🟡 **MEDIUM** — silent name collision (`koan/app/sdlc_state.py:245-252`)

Risk: Any issue name composed entirely of non-alphanumeric characters (e.g. /// or @@@) silently collapses to "unnamed", causing distinct issues to share the same workspace and corrupt each other's state.

safe = "".join(c if c.isalnum() or c in "-_." else "_" for c in issue_name.strip())
return safe.strip("._") or "unnamed"

Fix: Raise ValueError when the sanitised name reduces to empty instead of falling back to a shared "unnamed" bucket.


Automated review by Kōan (Claude · model claude-opus-4-6) HEAD=741c265 3 min 45s

@Koan-Bot Koan-Bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking issues found — see the review comment above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add SDLC state persistence layer for cross-mission phase tracking

3 participants