fix(apscheduler): prevent memory leaks from jitter and frequent job intervals #15846

jatorre · 2025-10-23T14:06:01Z

Problem

Critical memory leak in APScheduler causing 35GB+ memory allocations during proxy startup and operation. The leak was identified through Memray analysis showing massive allocations in APScheduler's normalize() and _apply_jitter() functions.

Root Cause

The jitter parameter in scheduled jobs was triggering expensive calculations in APScheduler's internal functions, combined with very frequent job intervals (10s) that caused memory to accumulate rapidly.

Solution

Key Changes

Removed jitter parameters from all scheduled jobs - jitter was the primary memory leak source
Configured AsyncIOScheduler with memory-optimized job_defaults:
- misfire_grace_time: 3600s (increased from 120s) to prevent backlog calculations
- coalesce: true to collapse missed runs
- max_instances: 1 to prevent concurrent job execution
- replace_existing: true to avoid duplicate jobs on restart
Increased minimum job intervals:
- PROXY_BATCH_WRITE_AT: 30s (was 10s)
- add_deployment/get_credentials jobs: 30s (was 10s)
Use fixed intervals with small random offsets instead of jitter for job distribution across workers
Explicitly configured jobstores and executors to minimize overhead
Disabled timezone awareness to reduce computation

Files Modified

litellm/proxy/proxy_server.py - Main scheduler configuration
litellm/constants.py - Updated PROXY_BATCH_WRITE_AT default (10s → 30s)
enterprise/litellm_enterprise/integrations/prometheus.py - Prometheus job config
tests/basic_proxy_startup_tests/test_apscheduler_memory_fix.py - Test suite (NEW)

Impact

Memory

Before: 35GB with 483M allocations during startup
After: <1GB with normal allocation patterns

Performance

Minimum job intervals increased from 10s → 30s (configurable via PROXY_BATCH_WRITE_AT env var)
Jobs still distributed across workers using random start offsets
No functional changes to job behavior, only timing and memory optimization

Testing

Added comprehensive test suite validating scheduler configuration
Verified no job execution backlog on startup
Tested duplicate job prevention with replace_existing

Breaking Changes

⚠️ Default PROXY_BATCH_WRITE_AT increased from 10s → 30s. Users can override via environment variable if they need more frequent updates (though this may reintroduce memory issues with very low values).

Context

This fix originated from production experience at CartoDB where memory leaks were causing proxy crashes. The fix has been battle-tested in production environments.

Note: Please assign @mdiloreto as reviewer/collaborator on this PR.

🤖 Generated with Claude Code

Co-Authored-By: Claude [email protected]

…ntervals Fixes critical memory leak in APScheduler that causes 35GB+ memory allocations during proxy startup and operation. The leak was identified through Memray analysis showing massive allocations in normalize() and _apply_jitter() functions. Key changes: 1. Remove jitter parameters from all scheduled jobs - jitter was causing expensive normalize() calculations leading to memory explosion 2. Configure AsyncIOScheduler with optimized job_defaults: - misfire_grace_time: 3600s (increased from 120s) to prevent backlog calculations that trigger memory leaks - coalesce: true to collapse missed runs - max_instances: 1 to prevent concurrent job execution - replace_existing: true to avoid duplicate jobs on restart 3. Increase minimum job intervals: - PROXY_BATCH_WRITE_AT: 30s (was 10s) - add_deployment/get_credentials jobs: 30s (was 10s) 4. Use fixed intervals with small random offsets instead of jitter for job distribution across workers 5. Explicitly configure jobstores and executors to minimize overhead 6. Disable timezone awareness to reduce computation Memory impact: - Before: 35GB with 483M allocations during startup - After: <1GB with normal allocation patterns Performance notes: - Minimum job intervals increased from 10s to 30s (configurable via env vars) - Jobs can still be distributed across workers using random start offsets - No functional changes to job behavior, only timing and memory optimization Testing: - Added comprehensive test suite for scheduler configuration - Verified no job execution backlog on startup - Tested duplicate job prevention with replace_existing Related issue: Memory leak in production proxy servers with APScheduler \ud83e\udd16 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

vercel · 2025-10-23T14:06:06Z

@jatorre is attempting to deploy a commit to the CLERKIEAI Team on Vercel.

A member of the Team first needs to authorize it.

jatorre · 2025-10-23T14:08:39Z

@mateo-di here is the PR

AlexsanderHamir

Great changes! Just missing a small doc update in docs/my-website/docs/proxy/config_settings.md for the proxy_batch_write_at default value change.

Update documentation to reflect the new default value for PROXY_BATCH_WRITE_AT changed in PR BerriAI#15846. The default was increased from 10 seconds to 30 seconds to prevent memory leaks in APScheduler. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

ishaan-jaff

small changes requested @jatorre

ishaan-jaff · 2025-10-23T21:05:10Z

litellm/proxy/proxy_server.py

+        scheduler = AsyncIOScheduler(
+            job_defaults={
+                "coalesce": True,              # collapse many missed runs into one
+                "misfire_grace_time": 3600,    # ignore runs older than 1 hour (was 120)


can these variables be placed in constants.py

ishaan-jaff · 2025-10-23T21:05:19Z

litellm/proxy/proxy_server.py

+                # REMOVED jitter parameter - major cause of memory leak
+                id="reset_budget_job",
+                replace_existing=True,
+                misfire_grace_time=3600,  # job-specific grace time


constants.py

@ishaan-jaff

Address code review feedback from ishaan-jaff: - Move scheduler configuration variables (coalesce, misfire_grace_time, max_instances, replace_existing) to litellm/constants.py - Update all references in proxy_server.py to use the constants - Improves maintainability and makes configuration values centralized Requested-by: @ishaan-jaff Related: BerriAI#15846 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

jatorre · 2025-10-24T21:11:25Z

@ishaan-jaff I've addressed your feedback by moving the APScheduler configuration variables to constants.py:

Added APSCHEDULER_COALESCE, APSCHEDULER_MISFIRE_GRACE_TIME, APSCHEDULER_MAX_INSTANCES, APSCHEDULER_REPLACE_EXISTING to litellm/constants.py
Updated litellm/proxy/proxy_server.py to import and use these constants throughout

The doc update for proxy_batch_write_at that @AlexsanderHamir requested was already included in the PR.

Latest commit: 1eabfd4

krrishdholakia · 2025-10-28T02:51:55Z

@ishaan-jaff let me know if this looks good to merge

jatorre · 2025-10-28T05:01:35Z

Yes it does

…

On Tue, 28 Oct 2025 at 03:52, Krish Dholakia ***@***.***> wrote: *krrishdholakia* left a comment (BerriAI/litellm#15846) <#15846 (comment)> @ishaan-jaff <https://github.com/ishaan-jaff> let me know if this looks good to merge — Reply to this email directly, view it on GitHub <#15846 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA7GO4UKWNUI7LYYV3VUC33Z3K6FAVCNFSM6AAAAACKAL7IP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTINJUGMYTCMBWHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ishaan-jaff

lgtm

AlexsanderHamir approved these changes Oct 23, 2025

View reviewed changes

jatorre marked this pull request as ready for review October 23, 2025 16:53

ishaan-jaff requested changes Oct 23, 2025

View reviewed changes

ishaan-jaff approved these changes Oct 29, 2025

View reviewed changes

ishaan-jaff merged commit e6a7cae into BerriAI:main Oct 29, 2025
3 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix(apscheduler): prevent memory leaks from jitter and frequent job intervals #15846

fix(apscheduler): prevent memory leaks from jitter and frequent job intervals #15846

Uh oh!

jatorre commented Oct 23, 2025

Uh oh!

vercel bot commented Oct 23, 2025

Uh oh!

jatorre commented Oct 23, 2025

Uh oh!

AlexsanderHamir left a comment

Uh oh!

ishaan-jaff left a comment

Uh oh!

ishaan-jaff Oct 23, 2025

Uh oh!

ishaan-jaff Oct 23, 2025

Uh oh!

jatorre commented Oct 24, 2025

Uh oh!

krrishdholakia commented Oct 28, 2025

Uh oh!

jatorre commented Oct 28, 2025 via email

Uh oh!

ishaan-jaff left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

fix(apscheduler): prevent memory leaks from jitter and frequent job intervals #15846

fix(apscheduler): prevent memory leaks from jitter and frequent job intervals #15846

Uh oh!

Conversation

jatorre commented Oct 23, 2025

Problem

Root Cause

Solution

Key Changes

Files Modified

Impact

Memory

Performance

Testing

Breaking Changes

Context

Uh oh!

vercel bot commented Oct 23, 2025

Uh oh!

jatorre commented Oct 23, 2025

Uh oh!

AlexsanderHamir left a comment

Choose a reason for hiding this comment

Uh oh!

ishaan-jaff left a comment

Choose a reason for hiding this comment

Uh oh!

ishaan-jaff Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

ishaan-jaff Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

jatorre commented Oct 24, 2025

Uh oh!

krrishdholakia commented Oct 28, 2025

Uh oh!

jatorre commented Oct 28, 2025 via email

Uh oh!

ishaan-jaff left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants