Skip to content

Latest commit

 

History

History
155 lines (116 loc) · 4.61 KB

File metadata and controls

155 lines (116 loc) · 4.61 KB

Deep Reinforcement Learning Implementations

This repository contains implementations of various deep reinforcement learning algorithms with experiments on different environments.

Project Overview

Implementations of reinforcement learning algorithms trained on multiple environments:

  • A3C (Asynchronous Advantage Actor-Critic) — Kuka Pick & Place manipulation task
  • A2C & REINFORCE — LunarLander-v2 continuous control task

Repository Structure

MLR/RL/
├── code/
│   ├── A2C/                    # Actor-Critic implementations for LunarLander
│   │   ├── actor.py           # Actor network
│   │   ├── critic.py          # Critic network
│   │   ├── train.py           # Training script
│   │   ├── eval.py            # Evaluation script
│   │   ├── compute_objectives.py  # Loss computation
│   │   ├── utils.py           # Utility functions
│   │   ├── config.json        # Configuration
│   │   ├── checkpoints/       # Trained models
│   │   ├── plots/             # Training curves
│   │   └── videos/            # Evaluation videos
│   └── A3C/                    # A3C implementation for Kuka
│       ├── main.py            # Entry point
│       ├── eval.py            # Evaluation script
│       ├── plot_training.py   # Visualization
│       ├── config/            # Environment and model configs
│       ├── lib/               # A3C algorithm implementation
│       ├── helpers/           # Helper utilities
│       ├── models/            # Trained checkpoints
│       ├── logs/              # Training logs
│       ├── plots/             # Training curves
│       └── requirements.txt   # Dependencies
└── Project3.pdf               # Assignment specification

Implementations

1. A3C on Kuka Pick & Place

Asynchronous Advantage Actor-Critic with 4 parallel workers training on robotic manipulation.

Environment: KukaDiverseObjectEnv (PyBullet)

  • Observation: RGB image (40×40, from 128×128)
  • Action space: 3D continuous (end-effector control)
  • Task: Pick and place diverse objects

Training Parameters:

  • Episodes: 10,000
  • Workers: 4 (asynchronous)
  • Training time: ~4.5 hours (CPU)
  • Device: CPU (CUDA unavailable)

Results:

  • Final average reward: ~0.35–0.38
  • Evaluation (100 episodes): 31% success rate
  • Average reward: 0.310

A3C Training Curve

2. REINFORCE on LunarLander-v2

Policy gradient method applied to continuous control.

Environment: LunarLander-v2

  • Observation: 8D state vector (position, velocity, angles, contact)
  • Action space: 2D continuous (thrust, rotation)
  • Task: Land the lunar module safely

Training metrics:

  • Episodes trained: 5,000
  • Learning rate: Adaptive scheduling
  • Convergence: ~500-1000 episodes

REINFORCE Learning Curve

3. A2C on LunarLander-v2

Actor-Critic method combining policy and value function learning.

Environment: LunarLander-v2 (same as REINFORCE)

Architecture:

  • Shared hidden layers: [128, 64]
  • Actor head: outputs action mean and log_std
  • Critic head: outputs state value estimate

Training metrics:

  • Episodes trained: 5,000
  • Workers/parallel processes: 1
  • Convergence: ~1000-2000 episodes

A2C Learning Curve

Setup & Installation

A3C (Kuka)

cd code/A3C
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Run training:

python main.py

Run evaluation (with GUI):

python eval.py --checkpoint models/a3c_kuka_model_final.pth --episodes 10 --render

Plot training curves:

python plot_training.py

A2C/REINFORCE (LunarLander)

cd code/A2C
pip install -r requirements.txt

Run training:

python train.py

Run evaluation:

python eval.py --checkpoint checkpoints/lunar_lander_actor.pt --episodes 5

A3C Experiment

A3C Diagram

Results Summary

Algorithm Environment Episodes Success Rate Avg Reward
A3C Kuka Pick & Place 10,000 31% 0.310
REINFORCE LunarLander-v2 5,000 ~80%* -50 to 0
A2C LunarLander-v2 5,000 ~85%* -20 to 0

*Vision-based reward threshold; higher scores indicate better performance.