-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
fix(apscheduler): prevent memory leaks from jitter and frequent job intervals #15846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(apscheduler): prevent memory leaks from jitter and frequent job intervals #15846
Conversation
…ntervals
Fixes critical memory leak in APScheduler that causes 35GB+ memory allocations
during proxy startup and operation. The leak was identified through Memray
analysis showing massive allocations in normalize() and _apply_jitter()
functions.
Key changes:
1. Remove jitter parameters from all scheduled jobs - jitter was causing
expensive normalize() calculations leading to memory explosion
2. Configure AsyncIOScheduler with optimized job_defaults:
- misfire_grace_time: 3600s (increased from 120s) to prevent backlog
calculations that trigger memory leaks
- coalesce: true to collapse missed runs
- max_instances: 1 to prevent concurrent job execution
- replace_existing: true to avoid duplicate jobs on restart
3. Increase minimum job intervals:
- PROXY_BATCH_WRITE_AT: 30s (was 10s)
- add_deployment/get_credentials jobs: 30s (was 10s)
4. Use fixed intervals with small random offsets instead of jitter for
job distribution across workers
5. Explicitly configure jobstores and executors to minimize overhead
6. Disable timezone awareness to reduce computation
Memory impact:
- Before: 35GB with 483M allocations during startup
- After: <1GB with normal allocation patterns
Performance notes:
- Minimum job intervals increased from 10s to 30s (configurable via env vars)
- Jobs can still be distributed across workers using random start offsets
- No functional changes to job behavior, only timing and memory optimization
Testing:
- Added comprehensive test suite for scheduler configuration
- Verified no job execution backlog on startup
- Tested duplicate job prevention with replace_existing
Related issue: Memory leak in production proxy servers with APScheduler
\ud83e\udd16 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <[email protected]>
|
@jatorre is attempting to deploy a commit to the CLERKIEAI Team on Vercel. A member of the Team first needs to authorize it. |
|
@mateo-di here is the PR |
AlexsanderHamir
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great changes! Just missing a small doc update in docs/my-website/docs/proxy/config_settings.md for the proxy_batch_write_at default value change.
Update documentation to reflect the new default value for PROXY_BATCH_WRITE_AT changed in PR BerriAI#15846. The default was increased from 10 seconds to 30 seconds to prevent memory leaks in APScheduler. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
ishaan-jaff
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small changes requested @jatorre
litellm/proxy/proxy_server.py
Outdated
| scheduler = AsyncIOScheduler( | ||
| job_defaults={ | ||
| "coalesce": True, # collapse many missed runs into one | ||
| "misfire_grace_time": 3600, # ignore runs older than 1 hour (was 120) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can these variables be placed in constants.py
litellm/proxy/proxy_server.py
Outdated
| # REMOVED jitter parameter - major cause of memory leak | ||
| id="reset_budget_job", | ||
| replace_existing=True, | ||
| misfire_grace_time=3600, # job-specific grace time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
constants.py
Address code review feedback from ishaan-jaff: - Move scheduler configuration variables (coalesce, misfire_grace_time, max_instances, replace_existing) to litellm/constants.py - Update all references in proxy_server.py to use the constants - Improves maintainability and makes configuration values centralized Requested-by: @ishaan-jaff Related: BerriAI#15846 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
|
@ishaan-jaff I've addressed your feedback by moving the APScheduler configuration variables to
The doc update for Latest commit: 1eabfd4 |
|
@ishaan-jaff let me know if this looks good to merge |
|
Yes it does
…On Tue, 28 Oct 2025 at 03:52, Krish Dholakia ***@***.***> wrote:
*krrishdholakia* left a comment (BerriAI/litellm#15846)
<#15846 (comment)>
@ishaan-jaff <https://github.com/ishaan-jaff> let me know if this looks
good to merge
—
Reply to this email directly, view it on GitHub
<#15846 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA7GO4UKWNUI7LYYV3VUC33Z3K6FAVCNFSM6AAAAACKAL7IP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTINJUGMYTCMBWHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
ishaan-jaff
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Problem
Critical memory leak in APScheduler causing 35GB+ memory allocations during proxy startup and operation. The leak was identified through Memray analysis showing massive allocations in APScheduler's
normalize()and_apply_jitter()functions.Root Cause
The
jitterparameter in scheduled jobs was triggering expensive calculations in APScheduler's internal functions, combined with very frequent job intervals (10s) that caused memory to accumulate rapidly.Solution
Key Changes
misfire_grace_time: 3600s (increased from 120s) to prevent backlog calculationscoalesce: true to collapse missed runsmax_instances: 1 to prevent concurrent job executionreplace_existing: true to avoid duplicate jobs on restartPROXY_BATCH_WRITE_AT: 30s (was 10s)add_deployment/get_credentialsjobs: 30s (was 10s)Files Modified
litellm/proxy/proxy_server.py- Main scheduler configurationlitellm/constants.py- Updated PROXY_BATCH_WRITE_AT default (10s → 30s)enterprise/litellm_enterprise/integrations/prometheus.py- Prometheus job configtests/basic_proxy_startup_tests/test_apscheduler_memory_fix.py- Test suite (NEW)Impact
Memory
Performance
PROXY_BATCH_WRITE_ATenv var)Testing
replace_existingBreaking Changes
PROXY_BATCH_WRITE_ATincreased from 10s → 30s. Users can override via environment variable if they need more frequent updates (though this may reintroduce memory issues with very low values).Context
This fix originated from production experience at CartoDB where memory leaks were causing proxy crashes. The fix has been battle-tested in production environments.
Note: Please assign @mdiloreto as reviewer/collaborator on this PR.
🤖 Generated with Claude Code
Co-Authored-By: Claude [email protected]