Notes on: Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning

On policy parallelized RL with Policy Update and Data Collection all on GPU
Uses PPO to walk on challenging terrains following base heading and linear velocity commands
Hyper-parameter -> $B = n_{robots}*n_{steps}$ ($n_{steps}$ cannot be kept arbitrary low - algo requires rewards from multiple time steps to be effective).
Game Inspired Curriculum: → Advance the robots who have learned one terrain to harder terrain → Start training on the easiest level of every terrain and update the difficulty based on the result → Robots solving the highest level are looped back to a randomly selected level to increase the diversity and avoid catastrophic forgetting
States - base $(v,ω)$, joints $(pos,v)$, gravity vector, policy’s previous actions, 108 terrain distance measurements from the robot’s base (sampled from a grid around the base)
Actions - Desired joint positions so no gait dependence
Sim2Real - Randomize friction, add noise to the states and randomly push the robot while training

Untitled