Skip to content

Conversation

@jatorre
Copy link
Contributor

@jatorre jatorre commented Oct 23, 2025

Problem

Critical memory leak in APScheduler causing 35GB+ memory allocations during proxy startup and operation. The leak was identified through Memray analysis showing massive allocations in APScheduler's normalize() and _apply_jitter() functions.

Root Cause

The jitter parameter in scheduled jobs was triggering expensive calculations in APScheduler's internal functions, combined with very frequent job intervals (10s) that caused memory to accumulate rapidly.

Solution

Key Changes

  1. Removed jitter parameters from all scheduled jobs - jitter was the primary memory leak source
  2. Configured AsyncIOScheduler with memory-optimized job_defaults:
    • misfire_grace_time: 3600s (increased from 120s) to prevent backlog calculations
    • coalesce: true to collapse missed runs
    • max_instances: 1 to prevent concurrent job execution
    • replace_existing: true to avoid duplicate jobs on restart
  3. Increased minimum job intervals:
    • PROXY_BATCH_WRITE_AT: 30s (was 10s)
    • add_deployment/get_credentials jobs: 30s (was 10s)
  4. Use fixed intervals with small random offsets instead of jitter for job distribution across workers
  5. Explicitly configured jobstores and executors to minimize overhead
  6. Disabled timezone awareness to reduce computation

Files Modified

  • litellm/proxy/proxy_server.py - Main scheduler configuration
  • litellm/constants.py - Updated PROXY_BATCH_WRITE_AT default (10s → 30s)
  • enterprise/litellm_enterprise/integrations/prometheus.py - Prometheus job config
  • tests/basic_proxy_startup_tests/test_apscheduler_memory_fix.py - Test suite (NEW)

Impact

Memory

  • Before: 35GB with 483M allocations during startup
  • After: <1GB with normal allocation patterns

Performance

  • Minimum job intervals increased from 10s → 30s (configurable via PROXY_BATCH_WRITE_AT env var)
  • Jobs still distributed across workers using random start offsets
  • No functional changes to job behavior, only timing and memory optimization

Testing

  • Added comprehensive test suite validating scheduler configuration
  • Verified no job execution backlog on startup
  • Tested duplicate job prevention with replace_existing

Breaking Changes

⚠️ Default PROXY_BATCH_WRITE_AT increased from 10s → 30s. Users can override via environment variable if they need more frequent updates (though this may reintroduce memory issues with very low values).

Context

This fix originated from production experience at CartoDB where memory leaks were causing proxy crashes. The fix has been battle-tested in production environments.

Note: Please assign @mdiloreto as reviewer/collaborator on this PR.


🤖 Generated with Claude Code

Co-Authored-By: Claude [email protected]

…ntervals

Fixes critical memory leak in APScheduler that causes 35GB+ memory allocations
during proxy startup and operation. The leak was identified through Memray
analysis showing massive allocations in normalize() and _apply_jitter()
functions.

Key changes:
1. Remove jitter parameters from all scheduled jobs - jitter was causing
   expensive normalize() calculations leading to memory explosion
2. Configure AsyncIOScheduler with optimized job_defaults:
   - misfire_grace_time: 3600s (increased from 120s) to prevent backlog
     calculations that trigger memory leaks
   - coalesce: true to collapse missed runs
   - max_instances: 1 to prevent concurrent job execution
   - replace_existing: true to avoid duplicate jobs on restart
3. Increase minimum job intervals:
   - PROXY_BATCH_WRITE_AT: 30s (was 10s)
   - add_deployment/get_credentials jobs: 30s (was 10s)
4. Use fixed intervals with small random offsets instead of jitter for
   job distribution across workers
5. Explicitly configure jobstores and executors to minimize overhead
6. Disable timezone awareness to reduce computation

Memory impact:
- Before: 35GB with 483M allocations during startup
- After: <1GB with normal allocation patterns

Performance notes:
- Minimum job intervals increased from 10s to 30s (configurable via env vars)
- Jobs can still be distributed across workers using random start offsets
- No functional changes to job behavior, only timing and memory optimization

Testing:
- Added comprehensive test suite for scheduler configuration
- Verified no job execution backlog on startup
- Tested duplicate job prevention with replace_existing

Related issue: Memory leak in production proxy servers with APScheduler

\ud83e\udd16 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@vercel
Copy link

vercel bot commented Oct 23, 2025

@jatorre is attempting to deploy a commit to the CLERKIEAI Team on Vercel.

A member of the Team first needs to authorize it.

@jatorre
Copy link
Contributor Author

jatorre commented Oct 23, 2025

@mateo-di here is the PR

Copy link
Collaborator

@AlexsanderHamir AlexsanderHamir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great changes! Just missing a small doc update in docs/my-website/docs/proxy/config_settings.md for the proxy_batch_write_at default value change.

Update documentation to reflect the new default value for PROXY_BATCH_WRITE_AT
changed in PR BerriAI#15846. The default was increased from 10 seconds to 30 seconds
to prevent memory leaks in APScheduler.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@jatorre jatorre marked this pull request as ready for review October 23, 2025 16:53
Copy link
Contributor

@ishaan-jaff ishaan-jaff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small changes requested @jatorre

scheduler = AsyncIOScheduler(
job_defaults={
"coalesce": True, # collapse many missed runs into one
"misfire_grace_time": 3600, # ignore runs older than 1 hour (was 120)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can these variables be placed in constants.py

# REMOVED jitter parameter - major cause of memory leak
id="reset_budget_job",
replace_existing=True,
misfire_grace_time=3600, # job-specific grace time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

constants.py

Address code review feedback from ishaan-jaff:
- Move scheduler configuration variables (coalesce, misfire_grace_time,
  max_instances, replace_existing) to litellm/constants.py
- Update all references in proxy_server.py to use the constants
- Improves maintainability and makes configuration values centralized

Requested-by: @ishaan-jaff
Related: BerriAI#15846

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@jatorre
Copy link
Contributor Author

jatorre commented Oct 24, 2025

@ishaan-jaff I've addressed your feedback by moving the APScheduler configuration variables to constants.py:

  • Added APSCHEDULER_COALESCE, APSCHEDULER_MISFIRE_GRACE_TIME, APSCHEDULER_MAX_INSTANCES, APSCHEDULER_REPLACE_EXISTING to litellm/constants.py
  • Updated litellm/proxy/proxy_server.py to import and use these constants throughout

The doc update for proxy_batch_write_at that @AlexsanderHamir requested was already included in the PR.

Latest commit: 1eabfd4

@krrishdholakia
Copy link
Contributor

@ishaan-jaff let me know if this looks good to merge

@jatorre
Copy link
Contributor Author

jatorre commented Oct 28, 2025 via email

Copy link
Contributor

@ishaan-jaff ishaan-jaff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ishaan-jaff ishaan-jaff merged commit e6a7cae into BerriAI:main Oct 29, 2025
3 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants