Skip to content

RL Path Planning#117

Draft
MinhxNguyen7 wants to merge 69 commits intomainfrom
path-planning
Draft

RL Path Planning#117
MinhxNguyen7 wants to merge 69 commits intomainfrom
path-planning

Conversation

@MinhxNguyen7
Copy link
Contributor

@MinhxNguyen7 MinhxNguyen7 commented Feb 7, 2025

Description

  • Implementation of path planning using model-free reinforcement learning.
    • Model-free RL necessary because the trajectory and collision cost functions are not differentiable w.r.t. the actor model (although this conclusion should be challenged).
  • TD3, PPO, and SAC will be tried and compared.
    • The training methods should be interchangeable when using stable_baselines3 and gymnasium.
  • The system outputs waypoints which can either be passed to ArduPilot or to our own controller.
  • This can be used in the future to solve other problems with RL, e.g., mission planning.

Technical Details

State Space

  • Current velocity, acceleration, and jerk.
  • Destination location relative to the current position.
  • Some representation of obstacles.

Action Space

  • The action space is the position delta and velocity between the current and next waypoint.
  • This can then be added to the current position and velocity to be used as world-frame waypoints.

Reward Function

The reward function is a weighted sum of the below functions.

Trajectory Cost

  • This is an estimate of the "control effort."
  • We estimate this with the weighted sum of velocity, acceleration, and snap.
  • Since we don't actually know the trajectory that our controller will generate, we generate a smooth trajectory by minimizing snap as a proxy.
  • The min-snap trajectory is piecewise-polynomial, so we can find its derivatives and integrate over their squares (to keep the values positive).

Collision Cost

  • The perception of obstacles can be based on raw particles from particle filtering or aggregated Gaussian tracks.
Raw Particles
  • We can pass a sample of raw particles to the models as part of the state.
  • The collision cost would be the minimum distance of the trajectory to each particle.
  • The function of the distance between the trajectory and a particle w.r.t. time is simple.
    • Since each particle has position, velocity, and maybe acceleration, their motion is polynomial over time.
    • Since the trajectory is also a polynomial over time, the function of the distance between a particle and the planned trajectory can be computed by subtracting the functions, i.e., subtracting the polynomial coefficients.
  • However, finding the minimum of that distance function is somewhat non-trivial.
    • We'll just sample it.
  • This allows us to skip track aggregation, but it increases sensitivity to changes in particle distribution.
    • This can happen if we tweak the algorithm or change the detection model.
Gaussian Tracks
  • We need to aggregate tracks first, which is highly non-trivial and difficult to parallelize.
  • Training another NN might be an option.
  • Needs more thinking

Distance/Time to Target

  • The distance to the target is trivial as well as differentiable.
    • Therefore, this can skip the critic.
  • We need some cost to encourage time efficiency.

Tasks

  • Define algorithm(s) and work breakdown.
    • Actor-Critic setup.
    • Reward functions.
      • Trajectory cost.
      • Collision cost.
      • Distance to target reward.
      • Time efficiency cost/reward.
  • Implement min-snap trajectory generation.
  • Implement reward functions.
    • Trajectory cost.
      • Non-linearity to encode velocity/acceleration/jerk limits (optional).
    • Collision cost.
    • Distance/time to target.
  • Implement training system and models.
    • Define and implement gymnasium.Env (representation of environment, i.e., inputs and outputs).
      • State (self trajectory, simulation of other drones).
        • Particle filtering simulation (optional).
      • State-to-observation transformation.
      • Integration/calculation of reward function.
      • Action-state transition.
    • Define and implement models.
      • Actor.
      • Critic.
      • Feature extractor(s).
        • Network(s) shared between actor and critic to comprehend the observation space.
        • Look into deep learning for 3D point clouds/LiDAR
    • Training setup.
    • Logging and visualizations.
  • Train and tune
    • Try DDPG.
    • Try PPO.
    • Try SAC.

Test Plan

  • TBA

Issues

Resources

RL Algorithms

Trajectory Interpolation

Libraries

Point-Cloud Feature Extraction

@MinhxNguyen7 MinhxNguyen7 changed the title Path planning DDPG Path Planning Feb 12, 2025
@EricPedley
Copy link
Member

EricPedley commented Feb 12, 2025

Since we don't actually know the trajectory that our controller will generate, we generate a smooth trajectory by minimizing snap as a proxy.

Does whatever we're using to generate the min-snap trajectories also take into account the collision avoidance cost? I don't get how using the control cost for a min-snap trajectory helps us here.

Also, on the note of track association, if we figure it out I don't think we need RL, we can use graph of convex sets: https://underactuated.mit.edu/trajopt.html#example9. And for doing track association it might be worth looking into methods for lidar object detection because that's basically the same problem as track association for the particle filter. IDK how hard you expect doing this with RL to be, but if you wanna consider more options before going ahead it might be worth looking into.

@MinhxNguyen7
Copy link
Contributor Author

MinhxNguyen7 commented Feb 12, 2025

Does whatever we're using to generate the min-snap trajectories also take into account the collision avoidance cost?

No, but the polynomial trajectory will be used to compute the collision cost, so as long as the min-snap interpolation roughly represents the real-life path of the drone if given those granular waypoints, it should be fine. I.e., the model will learn to work around the interpolator.

if we figure it out I don't think we need RL, we can use graph of convex sets

I've only skimmed the chapter, but these methods only works for static objects, right? If so, it would require us to both do track association and form convex shapes that for the trajectory to optimize around.

it might be worth looking into methods for lidar object detection because that's basically the same problem as track association for the particle filter

I'll take a look, but I think the most promising thing to do would be to use a NN. Do note, though that lidar and particle filtering result in substantially different point distributions.

IDK how hard you expect doing this with RL to be, but if you wanna consider more options before going ahead it might be worth looking into.

I don't anticipate the RL being more difficult than setting up these optimizations, especially considering the fact that they're even further outside of my wheelhouse.

@Dat-Bois
Copy link
Contributor

Dat-Bois commented Feb 12, 2025

Looks good to me, worried a bit about computational effort since we may want some form of MPC to run (very simple local horizon planning most likely) since when going off track it may be more optimal to regenerate the trajectory from that point rather than make a maneuver to get back on the original track. Maybe we can characterize the vehicle dynamics into the cost functions (ie inertial matrix, velocity and accel hard constraints (not sure if you had hard constraints)) and that can prove a more physically trackable trajectory and avoid MPC.

I'm not super familiar with using RL for trajectory generation so it may be much faster than I expect.

@MinhxNguyen7
Copy link
Contributor Author

MinhxNguyen7 commented Feb 13, 2025

worried a bit about computational effort

The training might be a little bit computationally involved, but the actor (which generates the waypoints) will be a relatively small neural network which should be much smaller than YOLO. I wouldn't expect generating a whole trajectory with many points to take more time than one image detection. The plan is to continuously regenerate the trajectory from the current position to the destination.

Maybe we can characterize the vehicle dynamics into the cost functions (ie inertial matrix, velocity and accel hard constraints (not sure if you had hard constraints))

Yes, the "control cost" will be a weighted sum of the velocity and acceleration. I can make the cost non-linear to reflect hard-ish limits, which I think would be better than a cost-cliff.

@EricPedley EricPedley marked this pull request as draft May 2, 2025 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants