- On policy parallelized RL with Policy Update and Data Collection all on GPU
- Uses PPO to walk on challenging terrains following base heading and linear velocity commands
- Hyper-parameter -> $B = n_{robots}*n_{steps}$ ($n_{steps}$ cannot be kept arbitrary low - algo requires rewards from multiple time steps to be effective).
- Game Inspired Curriculum:
→ Advance the robots who have learned one terrain to harder terrain
→ Start training on the easiest level of every terrain and update the difficulty based on the result
→ Robots solving the highest level are looped back to a randomly selected level to increase the diversity and avoid catastrophic forgetting
- States - base $(v,ω)$, joints $(pos,v)$, gravity vector, policy’s previous actions, 108 terrain distance measurements from the robot’s base (sampled from a grid around the base)
- Actions - Desired joint positions so no gait dependence
- Sim2Real - Randomize friction, add noise to the states and randomly push the robot while training
![Untitled](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/afba9af7-6f9b-43e1-a9ff-56e413939913/Untitled.jpeg)
![IMG_20230322_225346_1.jpg](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/0c86734d-4dcf-4ad1-97ee-3134d9c8f0e5/IMG_20230322_225346_1.jpg)
![Untitled](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/4eb50d9b-bbff-42c7-af2d-3afe3b1cf3aa/Untitled.png)