feat: add service:core error pattern fixes by icehippo · Pull Request #125 · HKUDS/ClawTeam

icehippo · 2026-04-06T05:23:17Z

Summary

P0 (63%, ~293/day): FallbackChain with exponential backoff retry for llm.sub_agent.fallback.failed. Configurable retryable_exceptions prevents retrying non-recoverable errors (auth, config).
P1 (18%, ~84/day): SafeContextBuilder with required field validation and graceful degradation (build_partial()) for memory_context_preparation_failed. Duplicate field detection prevents silent data loss.
P2 (6%, ~28/day): MemoryGuard with actual RSS monitoring via /proc/self/status (Linux) and chunked batch processing with GC hints for Worker SIGKILL OOM.

Addresses 81% of daily production errors (466 errors/24h, service:core env:prd).

Changes

File	Purpose
`clawteam/fixes/exceptions.py`	Custom exceptions: `CoreServiceError`, `SubAgentFallbackExhaustedError`, `MemoryContextPreparationError`, `WorkerMemoryLimitExceededError`
`clawteam/fixes/fallback_retry.py`	`RetryConfig` value object + `retry_with_backoff` + `FallbackChain`
`clawteam/fixes/memory_context.py`	`ContextField` value object + `SafeContextBuilder` factory
`clawteam/fixes/worker_memory.py`	`MemoryGuard` with RSS check + `chunked_processor`

Test plan

52 new tests passing (pytest)
Failure case ratio 62%+ across all test files
ruff lint: 0 errors
Existing tests unaffected (1 pre-existing failure in test_spawn_backends.py, macOS symlink issue)
Dual-pass code review (Phase 1: checklist/structure/test + Phase 2: deep architecture review)

Phase 0-5: AI dev team OS with per-project-type pipelines, dashboard UI, Jira/Datadog integration, E2E test infra, Second Brain knowledge store. Bug fixes from dual review pipeline: - Fix dual workflow identity in ProjectManager - Fix mailbox cross-consume (peek first) - Fix stale context cache on stage transition - Fix zombie process on health check failure - Unify stage prompts, add body size limit - Optimize N+1 query, add threading.Lock

Address 81% of daily production errors (466/24h): - P0 (63%): FallbackChain with exponential backoff retry for llm.sub_agent.fallback.failed - P1 (18%): SafeContextBuilder with required field validation for memory_context_preparation_failed - P2 (6%): MemoryGuard with RSS monitoring and chunked processing for Worker SIGKILL OOM Includes 52 tests (failure ratio 62%+).

icehippo added 2 commits April 5, 2026 16:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add service:core error pattern fixes#125

feat: add service:core error pattern fixes#125
icehippo wants to merge 2 commits intoHKUDS:mainfrom
icehippo:clawteam/e2e-test/project-feature-datadog-service-core

icehippo commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

icehippo commented Apr 6, 2026

Summary

Changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant