Redundancy and Load balancing

There should be redundant RL inference servers/nodes/models.

The API and supervisors should also be stateless besides the metrics collection server.
- automated health checking
- RAFT