Two main approaches for solving RL problems
In other words, how do we build an RL agent that can select the actions that maximize its expected cumulative reward?
The Policy π: the agent’s brain
The Policy π is the brain of our Agent, it’s the function that tells us what action to take given the state we are in. So it defines the agent’s behavior at a given time.
This Policy is the function we want to learn, our goal is to find the optimal policy π*, the policy that maximizes expected return when the agent acts according to it. We find this π* through training.
There are two approaches to train our agent to find this optimal policy π*:
- Directly, by teaching the agent to learn which action to take, given the current state: Policy-Based Methods.
- Indirectly, teach the agent to learn which state is more valuable and then take the action that leads to the more valuable states: Value-Based Methods.
Policy-Based Methods
In Policy-Based methods, we learn a policy function directly.
This function will define a mapping from each state to the best corresponding action. Alternatively, it could define a probability distribution over the set of possible actions at that state.
We have two types of policies:
- Deterministic: a policy at a given state will always return the same action.
- Stochastic: outputs a probability distribution over actions.
If we recap:
Value-based methods
In value-based methods, instead of learning a policy function, we learn a value function that maps a state to the expected value of being at that state.
The value of a state is the expected discounted return the agent can get if it starts in that state, and then acts according to our policy.
“Act according to our policy” just means that our policy is “going to the state with the highest value”.
Here we see that our value function defined values for each possible state.
Thanks to our value function, at each step our policy will select the state with the biggest value defined by the value function: -7, then -6, then -5 (and so on) to attain the goal.
If we recap:
< > Update on GitHub