OOM Protection and Auto-Restart Feature

Overview

This service includes a comprehensive Out of Memory (OOM) protection system that monitors memory usage and automatically triggers a graceful restart when memory thresholds are exceeded, preventing system crashes.

Features

1. Automatic Memory Monitoring

Background thread monitors system memory every 30 seconds
Tracks both system RAM and GPU memory usage
Detects when memory usage exceeds the configured threshold

2. Graceful Restart

When OOM is detected, the service initiates a graceful shutdown:
1. Stops accepting new requests
2. Waits for current processing to complete (60s timeout)
3. Clears GPU memory
4. Forces garbage collection
5. Restarts the service automatically

3. Enhanced Health Check

The /health endpoint now includes detailed memory information:

curl http://localhost:8000/health

Response:

{
  "status": "healthy",
  "model_loaded": true,
  "memory": {
    "system_memory_percent": 65.2,
    "system_memory_available_gb": 14.5,
    "system_memory_total_gb": 32.0,
    "process_memory_gb": 3.2,
    "process_memory_percent": 10.1,
    "gpu_memory_allocated_gb": 2.1,
    "gpu_memory_reserved_gb": 4.0
  },
  "oom_protection_enabled": true,
  "memory_threshold_percent": 90.0
}

4. Memory Monitoring Script

Use the provided script to monitor memory usage in real-time:

./monitor_memory.sh

Output:

==========================================
  OCR Service Memory Monitor
==========================================

📡 Service PID: 12345

[2026-03-19 10:30:00] Process: 3.20GB | System: 65.2% | GPU: 2.1 GB / 24.0 GB
[2026-03-19 10:30:05] Process: 3.25GB | System: 66.1% | GPU: 2.1 GB / 24.0 GB

Configuration

Environment Variables

Variable	Default	Description
`OOM_RESTART_ENABLED`	`true`	Enable/disable OOM protection
`OOM_MEMORY_THRESHOLD`	`90`	Memory threshold percentage (0-100)

Configuration Example (.env)

# Enable OOM protection (default: true)
OOM_RESTART_ENABLED=true

# Set memory threshold to 85% for more aggressive protection
OOM_MEMORY_THRESHOLD=85

How It Works

1. Background Monitoring

A dedicated background thread checks memory every 30 seconds:

# In serve_pdf.py
def monitor_memory_loop():
    # Checks:
    # - System memory usage
    # - Process memory usage
    # - GPU memory usage
    # - Triggers restart if threshold exceeded

2. Pre-Processing Check

Before processing each PDF:

Checks if system is already in OOM condition
Rejects processing if memory is critically low
Triggers restart if needed

3. Post-Processing Check

After processing each PDF:

Verifies memory hasn't exceeded threshold
Triggers restart if memory is high

4. Exception Handling

Catches specific OOM exceptions:

torch.cuda.OutOfMemoryError - GPU OOM
MemoryError - System RAM OOM
Automatically triggers graceful restart

Testing OOM Protection

Test 1: Monitor Memory

# Watch memory usage
./monitor_memory.sh

# In another terminal, process a large PDF
curl -X POST http://localhost:8000/process_pdf \
  -H "Authorization: Bearer your-token" \
  -F "file=@large.pdf"

Test 2: Simulate High Memory

You can temporarily lower the threshold to test the restart mechanism:

# In .env
OOM_MEMORY_THRESHOLD=30  # Will trigger restart at 30%

# Restart service
docker-compose restart

Test 3: Check Health Endpoint

watch -n 5 'curl -s http://localhost:8000/health | jq'

Troubleshooting

Service Keeps Restarting

Problem: Service enters restart loop

Solutions:

Increase OOM_MEMORY_THRESHOLD
Reduce BATCH_SIZE in serve_pdf.py
Reduce MAX_CONCURRENCY in config.py
Process smaller PDFs

Memory Still Too High

Problem: Even with protections, memory usage is too high

Solutions:

Reduce DPI in pdf_to_images_high_quality() (default: 144)
Reduce BATCH_SIZE in process_pdf_internal() (default: 10)
Reduce NUM_WORKERS in config.py
Limit concurrent requests

Monitoring Not Working

Problem: Memory monitor not starting

Check:

# Check logs
docker logs <container> | grep -i "memory monitor"

# Verify psutil is installed
python -c "import psutil; print(psutil.__version__)"

Log Messages

Normal Operation

✅ Memory monitor started (threshold: 90%, interval: 30s)

OOM Detected

⚠️  OOM CONDITION DETECTED:
   System Memory: 91.2%
   Process Memory: 18.50 GB
   Available Memory: 2.80 GB
🔄 INITIATING GRACEFUL RESTART DUE TO OOM CONDITION

Processing with OOM

❌ System OOM condition detected before processing: 92.5% memory usage
❌ GPU OOM: CUDA out of memory

Best Practices

Monitor Regularly: Use monitor_memory.sh during operation
Set Appropriate Threshold: 90% is default; adjust based on your system
Process Smaller Batches: If you have memory issues, reduce BATCH_SIZE
Check Health Endpoint: Use /health to monitor memory trends
Review Logs: Check for OOM warnings to identify problematic files

Performance Impact

Memory overhead: ~5-10 MB for monitoring thread
CPU overhead: Negligible (<0.1% CPU)
Restart time: ~5-10 seconds for graceful shutdown

Support

For issues or questions about OOM protection:

Check logs for OOM warnings
Use /health endpoint to monitor memory
Review configuration in .env file
Adjust threshold based on your system capacity

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM Protection and Auto-Restart Feature

Overview

Features

1. Automatic Memory Monitoring

2. Graceful Restart

3. Enhanced Health Check

4. Memory Monitoring Script

Configuration

Environment Variables

Configuration Example (.env)

How It Works

1. Background Monitoring

2. Pre-Processing Check

3. Post-Processing Check

4. Exception Handling

Testing OOM Protection

Test 1: Monitor Memory

Test 2: Simulate High Memory

Test 3: Check Health Endpoint

Troubleshooting

Service Keeps Restarting

Memory Still Too High

Monitoring Not Working

Log Messages

Normal Operation

OOM Detected

Processing with OOM

Best Practices

Performance Impact

Support

FilesExpand file tree

OOM_PROTECTION.md

Latest commit

History

OOM_PROTECTION.md

File metadata and controls

OOM Protection and Auto-Restart Feature

Overview

Features

1. Automatic Memory Monitoring

2. Graceful Restart

3. Enhanced Health Check

4. Memory Monitoring Script

Configuration

Environment Variables

Configuration Example (.env)

How It Works

1. Background Monitoring

2. Pre-Processing Check

3. Post-Processing Check

4. Exception Handling

Testing OOM Protection

Test 1: Monitor Memory

Test 2: Simulate High Memory

Test 3: Check Health Endpoint

Troubleshooting

Service Keeps Restarting

Memory Still Too High

Monitoring Not Working

Log Messages

Normal Operation

OOM Detected

Processing with OOM

Best Practices

Performance Impact

Support