-
Notifications
You must be signed in to change notification settings - Fork 135
feat: Multi-turn agentic architecture #56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
GeorgeWingg
wants to merge
35
commits into
SakanaAI:main
Choose a base branch
from
GeorgeWingg:feat/multi-turn-architecture-clean
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
feat: Multi-turn agentic architecture #56
GeorgeWingg
wants to merge
35
commits into
SakanaAI:main
from
GeorgeWingg:feat/multi-turn-architecture-clean
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This commit adds the foundational agentic multi-turn editing architecture: **New Components:** - AgenticConfig and EvaluatorConfig dataclasses for configuration - _run_agentic_patch() method for multi-turn agent sessions - Support for ShinkaAgent (native) and Codex CLI backends - AgenticEditor harness for managing agent sessions - Session registry for tracking active agent processes - Embedding corpus builder for multi-file novelty support **Integration Points:** - agentic_mode flag in EvolutionConfig (disabled by default) - Routing in run_patch() to agentic path when enabled - Multi-file diff generation for visualization **Preserved:** - All existing language support (Swift, JSON, etc.) - Legacy single-file patch workflow unchanged - No deletions to async_apply.py, pricing.py, or scheduler.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Create boid.py with Vector2D and Boid classes - Create simulation.py with SimulationEnvironment - Create render.py with terminal and matplotlib renderers - Create main.py as the entry point - Create initial.py as suboptimal starting point (score ~48) - Add task config: configs/task/boids_flocking.yaml - Add variant config: configs/variant/boids_flocking.yaml This example demonstrates multi-file editing with evolution. The initial implementation has deliberately suboptimal weights to allow room for evolutionary improvement. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
TerminalRenderer.render() now accepts (positions, velocities, step) to match MatplotlibRenderer, fixing the fallback when matplotlib is unavailable. Also added close() method for interface consistency. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Prevent Codex CLI option injection via prompts - Enforce scratch-dir path/size limits and safer permissions - Escape agentic metadata in UI and hide bulky diff blobs - Make agentic.yaml use supported backend defaults
- Add bandit model selection before agentic sessions (parity with legacy) - Track bandit-selected model for proper reward updates - Fix Codex backend to respect extra_cli_config model override - Fix apply_full_patch parameter names in agentic path - Fix boids_flocking variant config (add variant_suffix, remove n_pop)
- Add agentic variant config for boids multi-file task - Fix Hydra config override using @_global_ package syntax - Fix boids task config to nest evo_config properly for merging - Change default agentic model from gpt-5.2 to gpt-4.1 - Fix display.py NoneType subscript bug in patch_name 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add gpt-5.2 to OPENAI_MODELS pricing and REASONING_OAI_MODELS - Update agentic.yaml default model to gpt-5.2 - Add EXECPLAN_PR_READY.md for PR validation tracking 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Run quality bar checks (V8) on PR-modified Python files only. - black with default config - isort with --profile black 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
The PromptSampler was sending DIFF-format prompts to agentic sessions, causing agents to output <DIFF> XML instead of using shell commands. Root cause: PromptSampler had no awareness of agentic_mode. Fix: - AGENTIC_SYS_FORMAT is now empty (harness provides its own) - PromptSampler._sample_agentic() puts task context in user prompt - runner.py passes agentic_mode to PromptSampler Also fixed: - boids_flocking_agentic variant now correctly sets init_program_path - display.py handles None metadata gracefully V1.1 E2E test now passes: - Agent explores workspace with shell commands (ls, sed, etc.) - Files appear in gen_1/ - patch_type correctly set to "agentic" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
The redact_immutable function returned empty string when code had no EVOLVE-BLOCK markers, causing embedding API to fail with 400 error. Now returns full text for embedding when no markers are present. This affects tasks like boids_flocking that don't use markers. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
BREAKING: Removed silent fallback to gpt-4.1-mini in agentic backends. Before: If no model configured, silently used gpt-4.1-mini (old model) After: Raises clear error with instructions on how to configure Changes: - shinka_agent.py: Raises ShinkaExecutionError if no model - codex_cli.py: Raises CodexExecutionError if no model - agentic.yaml: Now explicitly sets model: "gpt-4.1" (required field) Also fixed: Inconsistent precedence order between backends Now both use: extra_cli_config["model"] > profile > FAIL Error message example: "No model configured for ShinkaAgent. Set evo_config.agentic.extra_cli_config.model..." 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Changes: - cost_utils.py: Log WARNING when model not in pricing table, use higher fallback rate ($10/M tokens) to make unknown models noticeable - credentials.py: Log DEBUG showing which credential source was used (env var vs credential file vs nested structure) - embedding.py: Consistent WARNING-level logging for both Gemini and OpenAI embedding failures; warn when model not in pricing table These changes help users diagnose configuration issues instead of silently using wrong values. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
The agentic mode was running jobs sequentially because _run_full_agentic_job called self.db.sample() inside worker threads, causing race conditions (SQLite connections are not thread-safe). Changes: - Move db.sample() to main thread in _submit_agentic_job_async() - Pass parent_program, archive_programs, top_k_programs to worker thread - Worker threads only do edit + eval (no database access) - Main loop uses while-loop to fill job queue for agentic mode - Add ThreadPoolExecutor for parallel agentic job execution Performance improvement: - Before: ~1 generation per 10 minutes (sequential) - After: ~3 programs per minute with 4 parallel jobs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Variant configuration for Circle Packing task with agentic editing: - Uses gemini-2.5-flash (OpenAI quota issues) - 4 parallel jobs for full parallelism testing - UCB bandit model selection 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Changes: - Check agentic_mode (not evaluator_mode) for parallel job submission - Add _run_legacy_evaluation_sync() for thread-safe legacy eval via subprocess - _run_full_agentic_job now supports both legacy and agentic evaluation - Thread pool created when agentic_mode is enabled (regardless of evaluator) This allows: agentic editing (parallel) + legacy evaluation (deterministic) Circle packing now runs with parallel editing and real sum-of-radii scoring. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Two bugs fixed: 1. metrics_path in agentic evaluator was relative but checked against Python's CWD instead of repo_root - converted to absolute path 2. Exception handler in runner hardcoded correct=False even when metrics.json existed with correct=True - now reads from metrics Both fixes verified working: boids reached score 80.0 with correct=1 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Changed shinka_agent to execute ALL bash blocks in a response, not just the first one (some models like Gemini output multiple) - Updated system prompt to reflect this change - Added reasoning_efforts="auto" default to avoid empty responses - Updated evaluator prompt to be more explicit about output path 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add max_events attribute to AgenticConfig (was missing, caused AttributeError) - Fix agentic.py to use max_events instead of max_turns for Codex event limit - Increase default max_events from 80 to 240 (3x) for longer sessions - Add _to_primitive() helper to convert OmegaConf DictConfig to JSON-serializable types - Extract session_id parsing to shared event_utils.py module - Handle Codex CLI non-zero exit gracefully when events were processed - Consolidate CodexAuthError into codex_cli.py (was in deleted codex_device_auth.py) These fixes enable Codex backend to complete full evolution runs without crashes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Remove unused build_embedding_corpus() function and supporting code: - EmbeddingCorpus dataclass (unused) - _is_text_bytes(), _sha256_prefix(), _matches_any() helpers (unused) - 195 lines of dead code that was never integrated Only extract_file_content() is actually used in the codebase. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
The codex_session_registry.py module was write-only dead code: - Created JSON files in ~/.codex/shinka_sessions/ tracking active sessions - But nothing ever read these files back Delete the module and remove all usages from codex_cli.py and shinka_agent.py. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
This was internal planning notes, not meant for the final PR. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
ASCII art rendering adds no value for headless evolution runs. Return None in headless mode instead. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Import from credentials.py instead of duplicating the mapping. Simplifies ensure_shinka_available() from 35 to 17 lines. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add comprehensive test coverage for agentic components: - test_agentic_editor.py (28 tests) - test_agentic_evaluator.py (13 tests) - test_shinka_agent.py (16 tests) - Update configs for boids/circle_packing tasks and variants - Update LLM models (gemini, openai, pricing, query) - Add gitignore for boids runtime artifacts - Remove deprecated codex_device_auth module - Remove unused boids initial.py (refactored to modular structure) - Fix database islands null-check for patch_name - Update scheduler and viz_tree for robustness 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Move logger initialization after all imports to follow PEP 8 conventions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Replace placeholder model 'gemini-3-flash-preview' with existing 'gemini-2.5-flash' model in boids and circle packing agentic configs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add EmbeddingCorpus dataclass to represent multi-file corpora - Implement build_embedding_corpus() for deterministic directory scanning - Add configurable glob patterns, size limits, and binary file handling - Refactor get_code_embedding() to support corpus mode with changed file prioritization - Maintain backward compatibility with existing single-file embedding mode - Add comprehensive logging for debugging corpus building This enables the novelty detection system to consider changes across multiple related files, improving semantic understanding for the agentic multi-turn editing architecture. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
@GeorgeWingg |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Key Changes
shinka/edit/agentic.py,codex_cli.py,shinka_agent.py- pluggable CLI harnesses that own system prompts and stream eventsshinka/eval/agentic.py- runs evaluation in agent sessions with metrics extractionshinka/core/embedding_corpus.py- builds embedding text from multiple workspace filesevolution/agentic.yamlbase config and variant configsTest plan
tests/test_agentic_*.py)🤖 Generated with Claude Code