Skip to content

Latest commit

 

History

History
42 lines (31 loc) · 2.03 KB

L2-RL-basics.md

File metadata and controls

42 lines (31 loc) · 2.03 KB

L2 RL Basics

These are my personal lecture notes for Georgia Tech's Reinforcement Learning course (CS 7642, Spring 2024) by Charles Isbell and Michael Littman. All images are taken from the course's lectures unless stated otherwise.

References and further readings

  • Littman, M. L. (1996). Algorithms for sequential decision-making. Brown University. (Chapter 2)

  • Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine learning, 3, 9-44.

  • Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.

    Chapters: 4.1-4.8, 5.1-5.4, 5.10, 6.1-6.7, 6.9, 8.1-8.5, 8.12-8.13

Introduction

  • RL is about the interaction between an agent and an environment
  • The agent takes actions and the environment responds to the actions with rewards and new states
  • RL is about learning a behavior that interacts with the environment:
    • Behavior structures:
      • Plan: fixed sequence of actions
      • Conditional plan: plan that includes "if" statements
      • Stationary policy (aka universal plan):
        • mapping from states to actions (like a conditional plan but has "if" at every state, so it's a mapping)
        • very large (know what to do in every state)
        • always optimal stationary policy

Evaluating a policy

Calculate the values of state-action-reward sequences generated by a policy:

  1. Map state transitions to immediate rewards (e.g. use reward function, R(s, a))
  2. Truncate sequences according to horizon (e.g. T = 10 steps)
  3. Summarize each sequence (i.e. compute the return for each sequence, e.g. discounted sum of rewards $\sum_{t=0}^T \gamma^t r_t$)
  4. summarize over sequences (average: expected return)

Evaluating a learner

  • value of returned policy
  • time to learn:
    • computational complexity
    • sample complexity
      • How much data it needs?
      • How much time does it take to interact with the environment to gather data?