More resilient LLM retries (auto-retry on transient failures)

During long runs, Strix sometimes fails to connect to the LLM backend (e.g. Ollama hosted on another server). When this happens, the whole run can stall/halt until manual intervention.

We already have retry logic in llm.py (exponential backoff), but it needs to be more robust for real-world flaky connections: longer retry windows, smarter retry conditions (network timeouts / connection refused).