How to inspect your deep Q-learning?

Training reinforcement learning (RL) can be difficult. Even if you coded the RL algorithm correctly, getting it to train well on a specific environment can take some work. This mostly stems from a large number of hyper-parameters that we can tweak, which can alter:

(1) the dynamics of the agent navigating this environment,

(2) the dynamics of gradient descent.

The two are at constant interplay. The way the agent explores the environment during training affects what the agent learns from that experience. And the way that learning progresses affects the way in which the agent navigates the environment in the future.

Hence, it’s useful to understand a couple of key indicators that you can look at during training, which can guide your hyper-parameter choice.

In this post, we’ll use a simple instance of training a deep Q-learning RL agent (DqnAgent from TF-Agents). We’ll use a 6-by-4 grid world environment where the agent (blue) moves towards the target (red) tile, and once it does, it receives a +1 reward. Any other transition results in a 0 reward. The state is described by the tuple (agent’s position, target’s position).

Rewards should be higher with a better policy

The first indication that your policy is improving with training is that as it’s queried later in the training it yields higher rewards. To check this, you can implement the exploration probability decay with training time. For example, you may decay the exploration probability, \(\varepsilon\), from, say, 0.5 at the beginning of training to 0.0 at the end of training, and even let it stay zero for a couple of final episodes. As \(\varepsilon\) decreases with training time, your learned policy is queried more and more frequently and random actions are taken less and less frequently. As a result, at the end of training, you should see high (or highest possible) rewards being achieved in the environment.

Below is the result that you’d like to see as \(\varepsilon \rightarrow 0\) with training time. With more training, the learned policy starts to successfully move the agent towards the target every time, resulting in a +1 reward.

If, on the other hand, average rewards stagnate at a low value even as your policy is queried more and more frequently, it is an indication that something isn’t setup right in your RL. My first go-to hyper-parameter to tweak then would be the learning rate.

Of course, you also don’t want the rewards to increase and then come back down again.

Q-values should be converging and racing each other

The outputs of the deep Q-network are the Q-values. They are estimates of the value of taking a given action when in a particular state. The action selected by the trained policy is the argmax over all Q-values. In this environment, we allow four actions: go up, do down, go left, go right. Hence, we have four Q-values.

During training, we can inspect a selection of Q-values for fixed state transitions in the environment in order to see if the Q-values behave the way they should. Let’s monitor the Q-values for the following arrangement of the agent (blue) and target (red):

We know that for such arrangement, the best action to take is to “go right”. Here’s how the Q-values for that state evolve with training time, and where the policy solves the task perfectly at the end of training:

There’s a couple of items that I wanted to point out in the figure above.

1. The best action should always result in the maximum Q-value

Once we take argmax over the Q-values (i.e., we execute the policy in that state) at the end of training (episode 500), the action selected is indeed to go right because the maximum Q-value for that state is the fourth Q-value, \(Q_4\). Notice that it wasn’t necessarily the case at the beginning of training. If we zoom in at the Q-values at the early episodes, the action “go right” wouldn’t always be the winning one:

2. Q-values should converge

Notice that the Q-values converge. They plateau at the end of training and become much less noisy than at the beginning of training:

3. The maximum Q-values should approach the expected reward value

Observe, that the numerical values of the final Q-values are within some ballpark of 1.0, which, not incidentally, is the maximum achievable reward over one full episode in this environment. This is what one should generally expect in deep Q-learning, even though we may not always know a priori what that maximum possible reward (the expected reward) is, especially in stochastic environments.

This is the direct aftermath of how the Q-values are updated at each training step (see the Bellman equation):

\(Q_{\pi}(s, a) = R(s, a) + \gamma \sum_{s'} \text{max} \big( Q_{\pi}(s',a') \big)\)

In the equation above, you can see that the Q-values are constructed from the instantaneous reward plus the sum of discounted future rewards (assuming a deterministic environment). Generally, you can expect that the Q-values will converge to within some ballpark from the maximum possible expected reward which also depends on the discount factor, \(\gamma\).

4. Q-values should stick together and race each other

The deep Q-learning algorithm relies on taking argmax over all Q-values to determine the right action at each state of the environment. Assuming that all actions are necessary and taken frequently in the environment, we can expect that there shouldn’t be too huge numeric differences in the Q-values. If one Q-value persistently drifts in its numeric value then either it will never be taken (it’s the smallest Q-value) or it is the winning action in each state (it’s the largest Q-value). If that’s the case, then we’re not really learning a complex policy, which should allow switching between actions. (The exception being environments which have a clear preference for a subset of actions and other actions should rarely be taken.) For a well-trained policy, the magnitudes of Q-values should change slightly from state to state to allow for the appropriate action being selected in each state. In other words, Q-values should stick together and always be racing each other in various states. This, on the other hand, is an indication that something did not go well during training:

While the argmax over these Q-values will still return the action “go right”, and while all Q-values seem to converge at the end of training, the general trends for those Q-values are different. Q-values for actions up, down, and left drifted from the Q-value for the action go right.

The reason why you do not want large differences between Q-values is that in the relatively nearby states the best action selected should still have the option to change rapidly. For example, if the agent is in any of the four tiles directly around the target, the best action would change from “go up”, to “go down”, to “go left”, to “go right”, but the state value (which is the input to the deep Q-network) does not change that much. The best course of action is for the network to learn to make tiny adjustment in the Q-values that allow for action switching per tiny difference in state value. Otherwise, if there are huge differences in Q-values in one state, the network outputs might have a hard time switching back to a different action in a similar state, and the policy may fail. In other words, you do not want the deep Q-network become too stiff.

Looking at training loss

As the Q-values converge to fixed values, the loss should converge too:

But there’s a caveat…

Why is the loss so noisy?

If you’re used to training deep neural networks for simple supervised learning tasks (such as regression), you may be used to loss functions (e.g., mean-squared-error losses) to be nicely converging and becoming almost flat at the end of training, especially when using learning rate decay to an appropriately small learning rate value.

If you’re now starting to learn RL, you will likely be terrified at the observation that the mean-squared-error loss for the Q-values is really noisy even if the RL training seems to be performing quite well policy-wise! Well, let’s discuss the reasons for this!

Remember what I mentioned about the Q-values needing to race each other? Well, training deep Q-learning is less “stable” than training a supervised deep learning with known ground truth, because we do not really provide the ground truth for the Q-values! The ground truth gets updated with more and more experience in interacting with the environment. So one may expect a lot of noise, especially at the beginning of training, due to the ground truth still establishing itself. Moreover, imagine that we have the optimal policy but now add a constant value to all Q-values every time we query the policy. Would that change the verdict of the argmax? No. So Q-values can change up to a constant value and still result in correct behavior of the policy. But this can really confuse error-based loss functions, since they may be bouncing from one set of Q-values to another, yet the policy still executes the task in the environment perfectly. In theory, the Q-values are guaranteed to converge as long as \(0.0 < \gamma < 1.0\) (see Chapter 2 of Barto & Sutton). In practice, gradient descent can handle that well in some cases and make the Q-values converge to their true values, but it may not in some other cases. If between episodes the Q-values change by a constant factor (even slightly), this can also instantaneously increase the error-based loss, which is why you may see more noise than what you’re used to seeing in supervised learning.

The orchestration of hyper-parameters

Here’s the quirky bit of RL: The same exploration probability (\(\varepsilon\)) decay and learning rate (\(\alpha\)) decay spread over various durations of training can lead to different training outcomes! In other words, there seems to be the right orchestration between how we position the decay of \(\varepsilon\) and \(\alpha\) over the duration of training, all other hyper-parameters being equal. Below is a small illustration where I locate the same decay schemes, either over 500 or 1000 episodes. The exploration probability decays linearly from 0.1 at the first episode to 0.0 at the last episode (being either 500 or 1000). In the same way, the cosine learning rate decay is spread over the episodes from \(10^{-2}\) at the first episode to \(10^{-5}\) at the last episode.

Take a look at the outcomes of these two trainings, the first ran for 500 episodes, the second for 1000 episodes:

This first figure shows us that learning the policy perfectly in this environment is possible with just 500 episodes.

But if all we had ever tried was 1000 episodes, we would have also learned the policy perfectly, but we would not have known that a much shorter training period could have already accomplished the task! Notice that here, at episode 500, we’re still not solving the task correctly even half of the time, even though \(\varepsilon = 0.05\) at episode 500 (i.e., random actions are taken only 5% of the time–a reasonably small number that a well-trained policy should be able to correct for by taking on-policy actions the remaining 95% of the time).

So there’s always the two questions: How much training is too much training? Or, with how little training can we get away with? This is where training RL becomes a bit of an art! This small 6-by-4 grid world is a very simple task and I was expecting the RL agent not needing too many trials to learn. 500 episodes with 20 steps in the environment per episode was more than enough to learn the perfect policy. A more complicated task would require more training time. With more experience of training across environments with varying complexity you will start to be able to take good guesses for the training time.