how we can fully characterize the leader strategy through queries
how the effect of the leader policy on the follower can be fully understood with a minimal amount of queries
Below are the listed papers
-
distribution of tasks p(T)
- different tax brackets as a distribution over multiple planner polices
-
the meta-training process learns a policy that adapts to the task at hand by conditioning on the history of past transitions, which we refer to as context c
- we can query actions of the planner given the observation of the env
-
Condition the policy as \pi_θ (a|s, z) in order to adapt its behavior to the task
- create embeddings of the planner networks to train a latent contect representing all possible policies
-
We train an inference network qφ(z|c), parameterized by φ, that estimates the posterior p(z|c)
-
optimizing qφ(z|c) to reconstruct the MDP by learning a predictive models of reward and dynamics
-
qφ(z|c) can be optimized in a model-free manner to model the state-action value functions
-
modeling qφ(z|c) as a product of independent factors
-
use Gaussian factors Ψφ(z|cn) = N (f μφ (cn), f σφ (cn)), which result in a Gaussian posterior.