Skip to content

feat: add service:core error pattern fixes#125

Open
icehippo wants to merge 2 commits intoHKUDS:mainfrom
icehippo:clawteam/e2e-test/project-feature-datadog-service-core
Open

feat: add service:core error pattern fixes#125
icehippo wants to merge 2 commits intoHKUDS:mainfrom
icehippo:clawteam/e2e-test/project-feature-datadog-service-core

Conversation

@icehippo
Copy link
Copy Markdown

@icehippo icehippo commented Apr 6, 2026

Summary

  • P0 (63%, ~293/day): FallbackChain with exponential backoff retry for llm.sub_agent.fallback.failed. Configurable retryable_exceptions prevents retrying non-recoverable errors (auth, config).
  • P1 (18%, ~84/day): SafeContextBuilder with required field validation and graceful degradation (build_partial()) for memory_context_preparation_failed. Duplicate field detection prevents silent data loss.
  • P2 (6%, ~28/day): MemoryGuard with actual RSS monitoring via /proc/self/status (Linux) and chunked batch processing with GC hints for Worker SIGKILL OOM.

Addresses 81% of daily production errors (466 errors/24h, service:core env:prd).

Changes

File Purpose
clawteam/fixes/exceptions.py Custom exceptions: CoreServiceError, SubAgentFallbackExhaustedError, MemoryContextPreparationError, WorkerMemoryLimitExceededError
clawteam/fixes/fallback_retry.py RetryConfig value object + retry_with_backoff + FallbackChain
clawteam/fixes/memory_context.py ContextField value object + SafeContextBuilder factory
clawteam/fixes/worker_memory.py MemoryGuard with RSS check + chunked_processor

Test plan

  • 52 new tests passing (pytest)
  • Failure case ratio 62%+ across all test files
  • ruff lint: 0 errors
  • Existing tests unaffected (1 pre-existing failure in test_spawn_backends.py, macOS symlink issue)
  • Dual-pass code review (Phase 1: checklist/structure/test + Phase 2: deep architecture review)

icehippo added 2 commits April 5, 2026 16:34
Phase 0-5: AI dev team OS with per-project-type
pipelines, dashboard UI, Jira/Datadog integration,
E2E test infra, Second Brain knowledge store.

Bug fixes from dual review pipeline:
- Fix dual workflow identity in ProjectManager
- Fix mailbox cross-consume (peek first)
- Fix stale context cache on stage transition
- Fix zombie process on health check failure
- Unify stage prompts, add body size limit
- Optimize N+1 query, add threading.Lock
Address 81% of daily production errors (466/24h):

- P0 (63%): FallbackChain with exponential backoff
  retry for llm.sub_agent.fallback.failed
- P1 (18%): SafeContextBuilder with required field
  validation for memory_context_preparation_failed
- P2 (6%): MemoryGuard with RSS monitoring and
  chunked processing for Worker SIGKILL OOM

Includes 52 tests (failure ratio 62%+).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant