Skip to content

Conversation

@GeorgeWingg
Copy link

Summary

  • Adds multi-turn agentic editing and evaluation backends (Codex CLI, Claude Agent SDK, Shinka Agent)
  • Implements multi-file workspace support with embedding corpus for novelty detection
  • Adds bandit sampling integration for agentic mode
  • Includes new example configs for boids_flocking_agentic and circle_packing_agentic variants

Key Changes

  • Agentic backends: shinka/edit/agentic.py, codex_cli.py, shinka_agent.py - pluggable CLI harnesses that own system prompts and stream events
  • Agentic evaluator: shinka/eval/agentic.py - runs evaluation in agent sessions with metrics extraction
  • Multi-file corpus: shinka/core/embedding_corpus.py - builds embedding text from multiple workspace files
  • Runner integration: Full async job pipeline with thread-safe parallelism for agentic mode
  • Configs: New evolution/agentic.yaml base config and variant configs

Test plan

  • Unit tests for agentic editor and evaluator (tests/test_agentic_*.py)
  • Manual test with circle_packing_agentic config
  • Manual test with boids_flocking_agentic config

🤖 Generated with Claude Code

GeorgeWingg and others added 30 commits December 14, 2025 12:47
This commit adds the foundational agentic multi-turn editing architecture:

**New Components:**
- AgenticConfig and EvaluatorConfig dataclasses for configuration
- _run_agentic_patch() method for multi-turn agent sessions
- Support for ShinkaAgent (native) and Codex CLI backends
- AgenticEditor harness for managing agent sessions
- Session registry for tracking active agent processes
- Embedding corpus builder for multi-file novelty support

**Integration Points:**
- agentic_mode flag in EvolutionConfig (disabled by default)
- Routing in run_patch() to agentic path when enabled
- Multi-file diff generation for visualization

**Preserved:**
- All existing language support (Swift, JSON, etc.)
- Legacy single-file patch workflow unchanged
- No deletions to async_apply.py, pricing.py, or scheduler.py

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Create boid.py with Vector2D and Boid classes
- Create simulation.py with SimulationEnvironment
- Create render.py with terminal and matplotlib renderers
- Create main.py as the entry point
- Create initial.py as suboptimal starting point (score ~48)
- Add task config: configs/task/boids_flocking.yaml
- Add variant config: configs/variant/boids_flocking.yaml

This example demonstrates multi-file editing with evolution.
The initial implementation has deliberately suboptimal weights
to allow room for evolutionary improvement.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
TerminalRenderer.render() now accepts (positions, velocities, step)
to match MatplotlibRenderer, fixing the fallback when matplotlib
is unavailable. Also added close() method for interface consistency.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Prevent Codex CLI option injection via prompts

- Enforce scratch-dir path/size limits and safer permissions

- Escape agentic metadata in UI and hide bulky diff blobs

- Make agentic.yaml use supported backend defaults
- Add bandit model selection before agentic sessions (parity with legacy)

- Track bandit-selected model for proper reward updates

- Fix Codex backend to respect extra_cli_config model override

- Fix apply_full_patch parameter names in agentic path

- Fix boids_flocking variant config (add variant_suffix, remove n_pop)
- Add agentic variant config for boids multi-file task
- Fix Hydra config override using @_global_ package syntax
- Fix boids task config to nest evo_config properly for merging
- Change default agentic model from gpt-5.2 to gpt-4.1
- Fix display.py NoneType subscript bug in patch_name

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add gpt-5.2 to OPENAI_MODELS pricing and REASONING_OAI_MODELS
- Update agentic.yaml default model to gpt-5.2
- Add EXECPLAN_PR_READY.md for PR validation tracking

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Run quality bar checks (V8) on PR-modified Python files only.
- black with default config
- isort with --profile black

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The PromptSampler was sending DIFF-format prompts to agentic sessions,
causing agents to output <DIFF> XML instead of using shell commands.

Root cause: PromptSampler had no awareness of agentic_mode.

Fix:
- AGENTIC_SYS_FORMAT is now empty (harness provides its own)
- PromptSampler._sample_agentic() puts task context in user prompt
- runner.py passes agentic_mode to PromptSampler

Also fixed:
- boids_flocking_agentic variant now correctly sets init_program_path
- display.py handles None metadata gracefully

V1.1 E2E test now passes:
- Agent explores workspace with shell commands (ls, sed, etc.)
- Files appear in gen_1/
- patch_type correctly set to "agentic"

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The redact_immutable function returned empty string when code had no
EVOLVE-BLOCK markers, causing embedding API to fail with 400 error.

Now returns full text for embedding when no markers are present.
This affects tasks like boids_flocking that don't use markers.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
BREAKING: Removed silent fallback to gpt-4.1-mini in agentic backends.

Before: If no model configured, silently used gpt-4.1-mini (old model)
After: Raises clear error with instructions on how to configure

Changes:
- shinka_agent.py: Raises ShinkaExecutionError if no model
- codex_cli.py: Raises CodexExecutionError if no model
- agentic.yaml: Now explicitly sets model: "gpt-4.1" (required field)

Also fixed: Inconsistent precedence order between backends
Now both use: extra_cli_config["model"] > profile > FAIL

Error message example:
"No model configured for ShinkaAgent. Set evo_config.agentic.extra_cli_config.model..."

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Changes:
- cost_utils.py: Log WARNING when model not in pricing table, use higher
  fallback rate ($10/M tokens) to make unknown models noticeable
- credentials.py: Log DEBUG showing which credential source was used
  (env var vs credential file vs nested structure)
- embedding.py: Consistent WARNING-level logging for both Gemini and
  OpenAI embedding failures; warn when model not in pricing table

These changes help users diagnose configuration issues instead of
silently using wrong values.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The agentic mode was running jobs sequentially because _run_full_agentic_job
called self.db.sample() inside worker threads, causing race conditions
(SQLite connections are not thread-safe).

Changes:
- Move db.sample() to main thread in _submit_agentic_job_async()
- Pass parent_program, archive_programs, top_k_programs to worker thread
- Worker threads only do edit + eval (no database access)
- Main loop uses while-loop to fill job queue for agentic mode
- Add ThreadPoolExecutor for parallel agentic job execution

Performance improvement:
- Before: ~1 generation per 10 minutes (sequential)
- After: ~3 programs per minute with 4 parallel jobs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Variant configuration for Circle Packing task with agentic editing:
- Uses gemini-2.5-flash (OpenAI quota issues)
- 4 parallel jobs for full parallelism testing
- UCB bandit model selection

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Changes:
- Check agentic_mode (not evaluator_mode) for parallel job submission
- Add _run_legacy_evaluation_sync() for thread-safe legacy eval via subprocess
- _run_full_agentic_job now supports both legacy and agentic evaluation
- Thread pool created when agentic_mode is enabled (regardless of evaluator)

This allows: agentic editing (parallel) + legacy evaluation (deterministic)
Circle packing now runs with parallel editing and real sum-of-radii scoring.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Two bugs fixed:
1. metrics_path in agentic evaluator was relative but checked against
   Python's CWD instead of repo_root - converted to absolute path
2. Exception handler in runner hardcoded correct=False even when
   metrics.json existed with correct=True - now reads from metrics

Both fixes verified working: boids reached score 80.0 with correct=1

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Changed shinka_agent to execute ALL bash blocks in a response,
  not just the first one (some models like Gemini output multiple)
- Updated system prompt to reflect this change
- Added reasoning_efforts="auto" default to avoid empty responses
- Updated evaluator prompt to be more explicit about output path

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add max_events attribute to AgenticConfig (was missing, caused AttributeError)
- Fix agentic.py to use max_events instead of max_turns for Codex event limit
- Increase default max_events from 80 to 240 (3x) for longer sessions
- Add _to_primitive() helper to convert OmegaConf DictConfig to JSON-serializable types
- Extract session_id parsing to shared event_utils.py module
- Handle Codex CLI non-zero exit gracefully when events were processed
- Consolidate CodexAuthError into codex_cli.py (was in deleted codex_device_auth.py)

These fixes enable Codex backend to complete full evolution runs without crashes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Remove unused build_embedding_corpus() function and supporting code:
- EmbeddingCorpus dataclass (unused)
- _is_text_bytes(), _sha256_prefix(), _matches_any() helpers (unused)
- 195 lines of dead code that was never integrated

Only extract_file_content() is actually used in the codebase.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The codex_session_registry.py module was write-only dead code:
- Created JSON files in ~/.codex/shinka_sessions/ tracking active sessions
- But nothing ever read these files back

Delete the module and remove all usages from codex_cli.py and shinka_agent.py.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
This was internal planning notes, not meant for the final PR.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
ASCII art rendering adds no value for headless evolution runs.
Return None in headless mode instead.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
GeorgeWingg and others added 5 commits December 18, 2025 22:17
Import from credentials.py instead of duplicating the mapping.
Simplifies ensure_shinka_available() from 35 to 17 lines.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add comprehensive test coverage for agentic components:
  - test_agentic_editor.py (28 tests)
  - test_agentic_evaluator.py (13 tests)
  - test_shinka_agent.py (16 tests)
- Update configs for boids/circle_packing tasks and variants
- Update LLM models (gemini, openai, pricing, query)
- Add gitignore for boids runtime artifacts
- Remove deprecated codex_device_auth module
- Remove unused boids initial.py (refactored to modular structure)
- Fix database islands null-check for patch_name
- Update scheduler and viz_tree for robustness

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Move logger initialization after all imports to follow PEP 8 conventions.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Replace placeholder model 'gemini-3-flash-preview' with existing
'gemini-2.5-flash' model in boids and circle packing agentic configs.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add EmbeddingCorpus dataclass to represent multi-file corpora
- Implement build_embedding_corpus() for deterministic directory scanning
- Add configurable glob patterns, size limits, and binary file handling
- Refactor get_code_embedding() to support corpus mode with changed file prioritization
- Maintain backward compatibility with existing single-file embedding mode
- Add comprehensive logging for debugging corpus building

This enables the novelty detection system to consider changes across
multiple related files, improving semantic understanding for the agentic
multi-turn editing architecture.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@zia1138
Copy link

zia1138 commented Dec 30, 2025

@GeorgeWingg
Curious how well your fork worked. What challenges or advantages did using codex cli have that you noticed? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants