feat: Multi-turn agentic architecture #56

GeorgeWingg · 2025-12-19T15:31:58Z

Summary

Adds multi-turn agentic editing and evaluation backends (Codex CLI, Claude Agent SDK, Shinka Agent)
Implements multi-file workspace support with embedding corpus for novelty detection
Adds bandit sampling integration for agentic mode
Includes new example configs for boids_flocking_agentic and circle_packing_agentic variants

Key Changes

Agentic backends: shinka/edit/agentic.py, codex_cli.py, shinka_agent.py - pluggable CLI harnesses that own system prompts and stream events
Agentic evaluator: shinka/eval/agentic.py - runs evaluation in agent sessions with metrics extraction
Multi-file corpus: shinka/core/embedding_corpus.py - builds embedding text from multiple workspace files
Runner integration: Full async job pipeline with thread-safe parallelism for agentic mode
Configs: New evolution/agentic.yaml base config and variant configs

Test plan

Unit tests for agentic editor and evaluator (tests/test_agentic_*.py)
Manual test with circle_packing_agentic config
Manual test with boids_flocking_agentic config

🤖 Generated with Claude Code

This commit adds the foundational agentic multi-turn editing architecture: **New Components:** - AgenticConfig and EvaluatorConfig dataclasses for configuration - _run_agentic_patch() method for multi-turn agent sessions - Support for ShinkaAgent (native) and Codex CLI backends - AgenticEditor harness for managing agent sessions - Session registry for tracking active agent processes - Embedding corpus builder for multi-file novelty support **Integration Points:** - agentic_mode flag in EvolutionConfig (disabled by default) - Routing in run_patch() to agentic path when enabled - Multi-file diff generation for visualization **Preserved:** - All existing language support (Swift, JSON, etc.) - Legacy single-file patch workflow unchanged - No deletions to async_apply.py, pricing.py, or scheduler.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Create boid.py with Vector2D and Boid classes - Create simulation.py with SimulationEnvironment - Create render.py with terminal and matplotlib renderers - Create main.py as the entry point - Create initial.py as suboptimal starting point (score ~48) - Add task config: configs/task/boids_flocking.yaml - Add variant config: configs/variant/boids_flocking.yaml This example demonstrates multi-file editing with evolution. The initial implementation has deliberately suboptimal weights to allow room for evolutionary improvement. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

TerminalRenderer.render() now accepts (positions, velocities, step) to match MatplotlibRenderer, fixing the fallback when matplotlib is unavailable. Also added close() method for interface consistency. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Prevent Codex CLI option injection via prompts - Enforce scratch-dir path/size limits and safer permissions - Escape agentic metadata in UI and hide bulky diff blobs - Make agentic.yaml use supported backend defaults

- Add bandit model selection before agentic sessions (parity with legacy) - Track bandit-selected model for proper reward updates - Fix Codex backend to respect extra_cli_config model override - Fix apply_full_patch parameter names in agentic path - Fix boids_flocking variant config (add variant_suffix, remove n_pop)

- Add agentic variant config for boids multi-file task - Fix Hydra config override using @_global_ package syntax - Fix boids task config to nest evo_config properly for merging - Change default agentic model from gpt-5.2 to gpt-4.1 - Fix display.py NoneType subscript bug in patch_name 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Add gpt-5.2 to OPENAI_MODELS pricing and REASONING_OAI_MODELS - Update agentic.yaml default model to gpt-5.2 - Add EXECPLAN_PR_READY.md for PR validation tracking 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Run quality bar checks (V8) on PR-modified Python files only. - black with default config - isort with --profile black 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

The PromptSampler was sending DIFF-format prompts to agentic sessions, causing agents to output <DIFF> XML instead of using shell commands. Root cause: PromptSampler had no awareness of agentic_mode. Fix: - AGENTIC_SYS_FORMAT is now empty (harness provides its own) - PromptSampler._sample_agentic() puts task context in user prompt - runner.py passes agentic_mode to PromptSampler Also fixed: - boids_flocking_agentic variant now correctly sets init_program_path - display.py handles None metadata gracefully V1.1 E2E test now passes: - Agent explores workspace with shell commands (ls, sed, etc.) - Files appear in gen_1/ - patch_type correctly set to "agentic" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

The redact_immutable function returned empty string when code had no EVOLVE-BLOCK markers, causing embedding API to fail with 400 error. Now returns full text for embedding when no markers are present. This affects tasks like boids_flocking that don't use markers. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

BREAKING: Removed silent fallback to gpt-4.1-mini in agentic backends. Before: If no model configured, silently used gpt-4.1-mini (old model) After: Raises clear error with instructions on how to configure Changes: - shinka_agent.py: Raises ShinkaExecutionError if no model - codex_cli.py: Raises CodexExecutionError if no model - agentic.yaml: Now explicitly sets model: "gpt-4.1" (required field) Also fixed: Inconsistent precedence order between backends Now both use: extra_cli_config["model"] > profile > FAIL Error message example: "No model configured for ShinkaAgent. Set evo_config.agentic.extra_cli_config.model..." 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Changes: - cost_utils.py: Log WARNING when model not in pricing table, use higher fallback rate ($10/M tokens) to make unknown models noticeable - credentials.py: Log DEBUG showing which credential source was used (env var vs credential file vs nested structure) - embedding.py: Consistent WARNING-level logging for both Gemini and OpenAI embedding failures; warn when model not in pricing table These changes help users diagnose configuration issues instead of silently using wrong values. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

The agentic mode was running jobs sequentially because _run_full_agentic_job called self.db.sample() inside worker threads, causing race conditions (SQLite connections are not thread-safe). Changes: - Move db.sample() to main thread in _submit_agentic_job_async() - Pass parent_program, archive_programs, top_k_programs to worker thread - Worker threads only do edit + eval (no database access) - Main loop uses while-loop to fill job queue for agentic mode - Add ThreadPoolExecutor for parallel agentic job execution Performance improvement: - Before: ~1 generation per 10 minutes (sequential) - After: ~3 programs per minute with 4 parallel jobs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Variant configuration for Circle Packing task with agentic editing: - Uses gemini-2.5-flash (OpenAI quota issues) - 4 parallel jobs for full parallelism testing - UCB bandit model selection 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Changes: - Check agentic_mode (not evaluator_mode) for parallel job submission - Add _run_legacy_evaluation_sync() for thread-safe legacy eval via subprocess - _run_full_agentic_job now supports both legacy and agentic evaluation - Thread pool created when agentic_mode is enabled (regardless of evaluator) This allows: agentic editing (parallel) + legacy evaluation (deterministic) Circle packing now runs with parallel editing and real sum-of-radii scoring. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Two bugs fixed: 1. metrics_path in agentic evaluator was relative but checked against Python's CWD instead of repo_root - converted to absolute path 2. Exception handler in runner hardcoded correct=False even when metrics.json existed with correct=True - now reads from metrics Both fixes verified working: boids reached score 80.0 with correct=1 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Changed shinka_agent to execute ALL bash blocks in a response, not just the first one (some models like Gemini output multiple) - Updated system prompt to reflect this change - Added reasoning_efforts="auto" default to avoid empty responses - Updated evaluator prompt to be more explicit about output path 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Add max_events attribute to AgenticConfig (was missing, caused AttributeError) - Fix agentic.py to use max_events instead of max_turns for Codex event limit - Increase default max_events from 80 to 240 (3x) for longer sessions - Add _to_primitive() helper to convert OmegaConf DictConfig to JSON-serializable types - Extract session_id parsing to shared event_utils.py module - Handle Codex CLI non-zero exit gracefully when events were processed - Consolidate CodexAuthError into codex_cli.py (was in deleted codex_device_auth.py) These fixes enable Codex backend to complete full evolution runs without crashes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Remove unused build_embedding_corpus() function and supporting code: - EmbeddingCorpus dataclass (unused) - _is_text_bytes(), _sha256_prefix(), _matches_any() helpers (unused) - 195 lines of dead code that was never integrated Only extract_file_content() is actually used in the codebase. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

The codex_session_registry.py module was write-only dead code: - Created JSON files in ~/.codex/shinka_sessions/ tracking active sessions - But nothing ever read these files back Delete the module and remove all usages from codex_cli.py and shinka_agent.py. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

This was internal planning notes, not meant for the final PR. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

ASCII art rendering adds no value for headless evolution runs. Return None in headless mode instead. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Import from credentials.py instead of duplicating the mapping. Simplifies ensure_shinka_available() from 35 to 17 lines. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Add comprehensive test coverage for agentic components: - test_agentic_editor.py (28 tests) - test_agentic_evaluator.py (13 tests) - test_shinka_agent.py (16 tests) - Update configs for boids/circle_packing tasks and variants - Update LLM models (gemini, openai, pricing, query) - Add gitignore for boids runtime artifacts - Remove deprecated codex_device_auth module - Remove unused boids initial.py (refactored to modular structure) - Fix database islands null-check for patch_name - Update scheduler and viz_tree for robustness 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Move logger initialization after all imports to follow PEP 8 conventions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Replace placeholder model 'gemini-3-flash-preview' with existing 'gemini-2.5-flash' model in boids and circle packing agentic configs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Add EmbeddingCorpus dataclass to represent multi-file corpora - Implement build_embedding_corpus() for deterministic directory scanning - Add configurable glob patterns, size limits, and binary file handling - Refactor get_code_embedding() to support corpus mode with changed file prioritization - Maintain backward compatibility with existing single-file embedding mode - Add comprehensive logging for debugging corpus building This enables the novelty detection system to consider changes across multiple related files, improving semantic understanding for the agentic multi-turn editing architecture. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

zia1138 · 2025-12-30T22:58:34Z

@GeorgeWingg
Curious how well your fork worked. What challenges or advantages did using codex cli have that you noticed? Thanks!

GeorgeWingg and others added 30 commits December 14, 2025 12:47

feat: Add multi-file diff viewer and agentic node indicator

bd46743

fix: Remove embedded script tag breaking HTML parser

e7faefe

fix: harden agentic backends and config

ea6e91e

- Prevent Codex CLI option injection via prompts - Enforce scratch-dir path/size limits and safer permissions - Escape agentic metadata in UI and hide bulky diff blobs - Make agentic.yaml use supported backend defaults

feat: codex headless auth (device + api key)

23915e0

fix: prefer subscription auth for codex

a860e08

fix: correct embedding corpus args for agentic files

ec6307e

feat: propagate multi-file workspace between generations

810e318

fix: hydrate workspace for legacy multi-file patches

1fda8e3

docs: update EXECPLAN with silent fallback fixes

0946ee4

chore: remove PR planning document

d80bff2

This was internal planning notes, not meant for the final PR. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

chore: remove unused TerminalRenderer from boids example

36c448d

ASCII art rendering adds no value for headless evolution runs. Return None in headless mode instead. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

GeorgeWingg and others added 5 commits December 18, 2025 22:17

fix: correct import order in codex_cli.py

92dbada

Move logger initialization after all imports to follow PEP 8 conventions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Multi-turn agentic architecture #56

feat: Multi-turn agentic architecture #56

Uh oh!

GeorgeWingg commented Dec 19, 2025

Uh oh!

zia1138 commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Multi-turn agentic architecture #56

Are you sure you want to change the base?

feat: Multi-turn agentic architecture #56

Uh oh!

Conversation

GeorgeWingg commented Dec 19, 2025

Summary

Key Changes

Test plan

Uh oh!

zia1138 commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants