- On policy parallelized RL with Policy Update and Data Collection all on GPU
- Uses PPO to walk on challenging terrains following base heading and linear velocity commands
- Hyper-parameter -> $B = n_{robots}*n_{steps}$ ($n_{steps}$ cannot be kept arbitrary low - algo requires rewards from multiple time steps to be effective).
- Game Inspired Curriculum:
→ Advance the robots who have learned one terrain to harder terrain
→ Start training on the easiest level of every terrain and update the difficulty based on the result
→ Robots solving the highest level are looped back to a randomly selected level to increase the diversity and avoid catastrophic forgetting
- States - base $(v,ω)$, joints $(pos,v)$, gravity vector, policy’s previous actions, 108 terrain distance measurements from the robot’s base (sampled from a grid around the base)
- Actions - Desired joint positions so no gait dependence
- Sim2Real - Randomize friction, add noise to the states and randomly push the robot while training


