The Pyramid environment
The goal in this environment is to train our agent to get the gold brick on the top of the Pyramid. To do that, it needs to press a button to spawn a Pyramid, navigate to the Pyramid, knock it over, and move to the gold brick at the top.
The reward function
The reward function is:
In terms of code, it looks like this
To train this new agent that seeks that button and then the Pyramid to destroy, we’ll use a combination of two types of rewards:
- The extrinsic one given by the environment (illustration above).
- But also an intrinsic one called curiosity. This second will push our agent to be curious, or in other terms, to better explore its environment.
If you want to know more about curiosity, the next section (optional) will explain the basics.
The observation space
In terms of observation, we use 148 raycasts that can each detect objects (switch, bricks, golden brick, and walls.)
We also use a boolean variable indicating the switch state (did we turn on or off the switch to spawn the Pyramid) and a vector that contains the agent’s speed.
The action space
The action space is discrete with four possible actions:
< > Update on GitHub