There should be redundant RL inference servers/nodes/models. The API and supervisors should also be stateless besides the metrics collection server. - automated health checking - RAFT