|
| 1 | +# OOM Protection and Auto-Restart Feature |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +This service includes a comprehensive Out of Memory (OOM) protection system that monitors memory usage and automatically triggers a graceful restart when memory thresholds are exceeded, preventing system crashes. |
| 6 | + |
| 7 | +## Features |
| 8 | + |
| 9 | +### 1. **Automatic Memory Monitoring** |
| 10 | +- Background thread monitors system memory every 30 seconds |
| 11 | +- Tracks both system RAM and GPU memory usage |
| 12 | +- Detects when memory usage exceeds the configured threshold |
| 13 | + |
| 14 | +### 2. **Graceful Restart** |
| 15 | +- When OOM is detected, the service initiates a graceful shutdown: |
| 16 | + 1. Stops accepting new requests |
| 17 | + 2. Waits for current processing to complete (60s timeout) |
| 18 | + 3. Clears GPU memory |
| 19 | + 4. Forces garbage collection |
| 20 | + 5. Restarts the service automatically |
| 21 | + |
| 22 | +### 3. **Enhanced Health Check** |
| 23 | +The `/health` endpoint now includes detailed memory information: |
| 24 | + |
| 25 | +```bash |
| 26 | +curl http://localhost:8000/health |
| 27 | +``` |
| 28 | + |
| 29 | +Response: |
| 30 | +```json |
| 31 | +{ |
| 32 | + "status": "healthy", |
| 33 | + "model_loaded": true, |
| 34 | + "memory": { |
| 35 | + "system_memory_percent": 65.2, |
| 36 | + "system_memory_available_gb": 14.5, |
| 37 | + "system_memory_total_gb": 32.0, |
| 38 | + "process_memory_gb": 3.2, |
| 39 | + "process_memory_percent": 10.1, |
| 40 | + "gpu_memory_allocated_gb": 2.1, |
| 41 | + "gpu_memory_reserved_gb": 4.0 |
| 42 | + }, |
| 43 | + "oom_protection_enabled": true, |
| 44 | + "memory_threshold_percent": 90.0 |
| 45 | +} |
| 46 | +``` |
| 47 | + |
| 48 | +### 4. **Memory Monitoring Script** |
| 49 | + |
| 50 | +Use the provided script to monitor memory usage in real-time: |
| 51 | + |
| 52 | +```bash |
| 53 | +./monitor_memory.sh |
| 54 | +``` |
| 55 | + |
| 56 | +Output: |
| 57 | +``` |
| 58 | +========================================== |
| 59 | + OCR Service Memory Monitor |
| 60 | +========================================== |
| 61 | +
|
| 62 | +📡 Service PID: 12345 |
| 63 | +
|
| 64 | +[2026-03-19 10:30:00] Process: 3.20GB | System: 65.2% | GPU: 2.1 GB / 24.0 GB |
| 65 | +[2026-03-19 10:30:05] Process: 3.25GB | System: 66.1% | GPU: 2.1 GB / 24.0 GB |
| 66 | +``` |
| 67 | + |
| 68 | +## Configuration |
| 69 | + |
| 70 | +### Environment Variables |
| 71 | + |
| 72 | +| Variable | Default | Description | |
| 73 | +|----------|---------|-------------| |
| 74 | +| `OOM_RESTART_ENABLED` | `true` | Enable/disable OOM protection | |
| 75 | +| `OOM_MEMORY_THRESHOLD` | `90` | Memory threshold percentage (0-100) | |
| 76 | + |
| 77 | +### Configuration Example (.env) |
| 78 | + |
| 79 | +```bash |
| 80 | +# Enable OOM protection (default: true) |
| 81 | +OOM_RESTART_ENABLED=true |
| 82 | + |
| 83 | +# Set memory threshold to 85% for more aggressive protection |
| 84 | +OOM_MEMORY_THRESHOLD=85 |
| 85 | +``` |
| 86 | + |
| 87 | +## How It Works |
| 88 | + |
| 89 | +### 1. Background Monitoring |
| 90 | +A dedicated background thread checks memory every 30 seconds: |
| 91 | + |
| 92 | +```python |
| 93 | +# In serve_pdf.py |
| 94 | +def monitor_memory_loop(): |
| 95 | + # Checks: |
| 96 | + # - System memory usage |
| 97 | + # - Process memory usage |
| 98 | + # - GPU memory usage |
| 99 | + # - Triggers restart if threshold exceeded |
| 100 | +``` |
| 101 | + |
| 102 | +### 2. Pre-Processing Check |
| 103 | +Before processing each PDF: |
| 104 | +- Checks if system is already in OOM condition |
| 105 | +- Rejects processing if memory is critically low |
| 106 | +- Triggers restart if needed |
| 107 | + |
| 108 | +### 3. Post-Processing Check |
| 109 | +After processing each PDF: |
| 110 | +- Verifies memory hasn't exceeded threshold |
| 111 | +- Triggers restart if memory is high |
| 112 | + |
| 113 | +### 4. Exception Handling |
| 114 | +Catches specific OOM exceptions: |
| 115 | +- `torch.cuda.OutOfMemoryError` - GPU OOM |
| 116 | +- `MemoryError` - System RAM OOM |
| 117 | +- Automatically triggers graceful restart |
| 118 | + |
| 119 | +## Testing OOM Protection |
| 120 | + |
| 121 | +### Test 1: Monitor Memory |
| 122 | +```bash |
| 123 | +# Watch memory usage |
| 124 | +./monitor_memory.sh |
| 125 | + |
| 126 | +# In another terminal, process a large PDF |
| 127 | +curl -X POST http://localhost:8000/process_pdf \ |
| 128 | + -H "Authorization: Bearer your-token" \ |
| 129 | + -F "file=@large.pdf" |
| 130 | +``` |
| 131 | + |
| 132 | +### Test 2: Simulate High Memory |
| 133 | +You can temporarily lower the threshold to test the restart mechanism: |
| 134 | + |
| 135 | +```bash |
| 136 | +# In .env |
| 137 | +OOM_MEMORY_THRESHOLD=30 # Will trigger restart at 30% |
| 138 | + |
| 139 | +# Restart service |
| 140 | +docker-compose restart |
| 141 | +``` |
| 142 | + |
| 143 | +### Test 3: Check Health Endpoint |
| 144 | +```bash |
| 145 | +watch -n 5 'curl -s http://localhost:8000/health | jq' |
| 146 | +``` |
| 147 | + |
| 148 | +## Troubleshooting |
| 149 | + |
| 150 | +### Service Keeps Restarting |
| 151 | +**Problem:** Service enters restart loop |
| 152 | + |
| 153 | +**Solutions:** |
| 154 | +1. Increase `OOM_MEMORY_THRESHOLD` |
| 155 | +2. Reduce `BATCH_SIZE` in serve_pdf.py |
| 156 | +3. Reduce `MAX_CONCURRENCY` in config.py |
| 157 | +4. Process smaller PDFs |
| 158 | + |
| 159 | +### Memory Still Too High |
| 160 | +**Problem:** Even with protections, memory usage is too high |
| 161 | + |
| 162 | +**Solutions:** |
| 163 | +1. Reduce DPI in `pdf_to_images_high_quality()` (default: 144) |
| 164 | +2. Reduce `BATCH_SIZE` in `process_pdf_internal()` (default: 10) |
| 165 | +3. Reduce `NUM_WORKERS` in config.py |
| 166 | +4. Limit concurrent requests |
| 167 | + |
| 168 | +### Monitoring Not Working |
| 169 | +**Problem:** Memory monitor not starting |
| 170 | + |
| 171 | +**Check:** |
| 172 | +```bash |
| 173 | +# Check logs |
| 174 | +docker logs <container> | grep -i "memory monitor" |
| 175 | + |
| 176 | +# Verify psutil is installed |
| 177 | +python -c "import psutil; print(psutil.__version__)" |
| 178 | +``` |
| 179 | + |
| 180 | +## Log Messages |
| 181 | + |
| 182 | +### Normal Operation |
| 183 | +``` |
| 184 | +✅ Memory monitor started (threshold: 90%, interval: 30s) |
| 185 | +``` |
| 186 | + |
| 187 | +### OOM Detected |
| 188 | +``` |
| 189 | +⚠️ OOM CONDITION DETECTED: |
| 190 | + System Memory: 91.2% |
| 191 | + Process Memory: 18.50 GB |
| 192 | + Available Memory: 2.80 GB |
| 193 | +🔄 INITIATING GRACEFUL RESTART DUE TO OOM CONDITION |
| 194 | +``` |
| 195 | + |
| 196 | +### Processing with OOM |
| 197 | +``` |
| 198 | +❌ System OOM condition detected before processing: 92.5% memory usage |
| 199 | +❌ GPU OOM: CUDA out of memory |
| 200 | +``` |
| 201 | + |
| 202 | +## Best Practices |
| 203 | + |
| 204 | +1. **Monitor Regularly**: Use `monitor_memory.sh` during operation |
| 205 | +2. **Set Appropriate Threshold**: 90% is default; adjust based on your system |
| 206 | +3. **Process Smaller Batches**: If you have memory issues, reduce BATCH_SIZE |
| 207 | +4. **Check Health Endpoint**: Use `/health` to monitor memory trends |
| 208 | +5. **Review Logs**: Check for OOM warnings to identify problematic files |
| 209 | + |
| 210 | +## Performance Impact |
| 211 | + |
| 212 | +- **Memory overhead**: ~5-10 MB for monitoring thread |
| 213 | +- **CPU overhead**: Negligible (<0.1% CPU) |
| 214 | +- **Restart time**: ~5-10 seconds for graceful shutdown |
| 215 | + |
| 216 | +## Support |
| 217 | + |
| 218 | +For issues or questions about OOM protection: |
| 219 | +1. Check logs for OOM warnings |
| 220 | +2. Use `/health` endpoint to monitor memory |
| 221 | +3. Review configuration in `.env` file |
| 222 | +4. Adjust threshold based on your system capacity |
0 commit comments