Markov Property (MP)
Markov Property (MP)
- The probability of reaching Sβ from S only depends on S, not on the history of earlier states
- Fundamental of Bellman Equations (mentioned later)
Markov Decision Processes (MDPs)
![](https://datawhalechina.github.io/easy-rl/chapter2/img/2.2.png)
The mathematical description of reinforcement learning is based on Markov Decision Process, which can be described as tiple: (S, A, R, P, Ο0)
where SοΌset of all possible states
A: set of all possible actions
R: rewards function
P: state transition function: P(sβ | s, a ) probability of sβ if take action a in state s
Ο0οΌThe initial state distribution
Discounted Return
Trajectory
- trajectory Ο is the set of states and actions
Οβ=β(s0,βa0,βs1,βa1,ββ Β·β β Β·β β Β·β )
Reward and Return
- The reward is the encouragement of a single step while the return is the cumulative rewards on a whole trajectory
- Usually, we will consider the discounted factor on the return, hence:
$$ R(\tau) = \sum_{t=0}^{\infty}\gamma^t r_t $$
Why do we need discounted factor ?
Cash now is better than cash later
Value Function
State Value function VΟ(s)β To estimate how good it is for the agent to be in a given state under policy Οββ VΟ(s)β=βEΟββΌβΟ[R(Ο)Β |Β s0β=βs]
Action Value function QΟ(s,βa)
To estimate how good it is for the agent to be in a given state and action under policy Ο QΟ(s,βa)β=βEΟββΌβΟ[R(Ο)Β |Β s0β=βs,βa0β=βa]
Bellman Equation
- It is frequent to estimate the value function in reinforcement learning. Bellman equation is the method to implement. The basic idea behind Bellman Equation is : Current Value = Current Reward + Future Value
Recursive Relation in Reinforcement Learning
- A fundamental property of value functions used throughout reinforcement learning is that it satisfy particular recursive relationship. The Backup Operation transfer value information back to a state/state-action pair from its successor states/state-action pairs.
Backup Diagram
As the above diagram illustrated, the calculation of state-value function can be decomposed into two steps:
Before derivation: black dots stand for state, action pairs, white dots stands for action. Lines in B stands for policy probability, lines in C stands for states transition probability.
In diagram B:
VΟ(s)β=ββaβββAΟ(a|s)QΟ(s,βa)
In diagram A: QΟ(s,βa)β=βR(s,βa)β +β Ξ³βsβ²βββSP(sβ²|s,βa)vΟ(sβ²) β Merging equation A and B, we get one form of Bellman Equation for VΟ VΟ(s)β=ββaβββAΟ(s|a)(R(s,βa)β +β Ξ³βsβ²βββSP(sβ²|s,βa)vΟ(sβ²))
Summary
Bellman Equation(s) defines a connection between the current state and future state
Given the simplified form of Bellman Equations
$$ \begin{cases}V^\pi(s)=\mathbb{E}[r(s,a)+\gamma V^\pi(s’)]\Q^\pi(s,a)=\mathbb{E}[r(s,a)+\gamma \mathbb{E}[Q^\pi(s’,a’)]]\V^(s)=\max_{a}\mathbb{E}[r(s,a)+\gamma V^(s’)]\Q^(s,a)=\mathbb{E}[r(s,a)+\gamma \max_{a’}Q^(s’,a’)]\end{cases} $$