Understanding the dynamics of the Markov decision process

In this post, I provide an illustrative explanation of the transition probabilities involved in executing sequences of (state, action, reward, next state) tuples by a reinforcement learning agent. These probabilities define the dynamics of the Markov decision process (MDP) and may seem intimidating at first (they did to me!), but here I break them down and explain them visually. This note accompanies the first section in Chapter 3 of the textbook by Barto & Sutton and gives an additional visual insight that I was missing when reading this chapter.

We consider a simple 2D grid world environment, where the agent can execute one of the four actions: go up, go down, go right, go left.

From deterministic to stochastic environments

Deterministic environments

In a deterministic reinforcement learning environment, executing an action \(a \in \mathcal{A}\) results in state transition from \(s \rightarrow s'\) with certainty, where \(s, s' \in \mathcal{S}\). That is, in our simple 2D grid world with four actions, executing the action “go up” is guaranteed to result in state transition to the state directly above the current state. This is presented in the figure below:

Environments with stochastic state transitions

But reinforcement learning environments don’t have to behave deterministically! The beauty of the MDP is that we can implement stochastic environments. In other words, we can assign probabilities of transitioning to other nearby states, even if the action selected was to “go up”! This can, for example, model malfunctioning of an electronic robot. The figure below presents an environment with stochastic state transitions. The action “go up” will only be correctly executed with 90% probability. But with 8% probability the next state, \(s'\), will be either the one to the right or to the left of the current state. With 2% probability, the next state will be the state below the current state.

Oh, stochastic state transitions, how you remind me of the teenage me… My parents could only predict with certain probability that I will behave the way they asked me to.

Environments with stochastic state transitions and stochastic rewards

Even more so, reinforcement learning environments don’t have to provide deterministic rewards, \(r \in \mathcal{R}\), for state transitions! Suppose that we are indeed transitioning to the state directly above the current state. A deterministic reward for executing this transition would always provide \(r = 1\). But in MDP, we want to model the possibility for the rewarding system malfunctioning too! And hence, with 80% probability we will receive the \(r = 1\) reward, but with 15% probability we will get \(r = 0\) as the reward, and with 5% probability our reward will be \(r = -1\).

…but to be fair, my parents also didn’t respond to my teenage behaviors the same way every time. Which is why I occasionally dodged a scold.

Understanding the probabilities involved

In a deterministic environment

\(p(s' | s, a) = 1 .\)

You can read \(p(s' | s, a)\) as the conditional probability of reaching state \(s'\), given that the current state is \(s\) and the action selected is \(a\).

In an environment with stochastic transition probabilities but with deterministic rewards, we can write

\(p(s' | s, a) \in [0,1] .\)

And in the most general case, an environment with stochastic state transitions and stochastic rewards, we can write

\(p(s', r | s, a) \in [0,1] ,\)

where you can read \(p(s', r | s, a)\) as the joint conditional probability of reaching state \(s'\) and receiving reward \(r\), given that the current state is \(s\) and the action selected is \(a\).

Some consequences

In the most general case, an environment with stochastic state transitions and stochastic rewards, we have that

\(\sum_{r \in \mathcal{R}} p(s', r | s, a) = p(s' | s, a)\)

This is easy to show looking at the third figure, where for example, for the transition from the current state to the state directly above it, we have

\(\sum_{r \in \mathcal{R}} p(s', r | s, a) = 0.8 \cdot 0.9 + 0.15 \cdot 0.9 + 0.05 \cdot 0.9 = 0.9 = p(s' | s, a) ,\)

because the sum of probabilities across all possible stochastic rewards associated with a particular state transition has to be equal to one.

Of course, a further consequence is that

\(\sum_{s \in \mathcal{S}} \sum_{r \in \mathcal{R}} p(s', r | s, a) = 1 ,\)

which simply pushes the earlier equation further by also summing over probabilities of all state transitions. Probabilities for all state transitions from any current state also have to sum up to one, which is the case in our simple grid world,

\(\sum_{s \in \mathcal{S}} p(s' | s, a) = 0.9 + 0.04 + 0.04 + 0.02 = 1 .\)