Coding the policy gradient method in PyTorch

The goal of reinforcement learning (RL) is to learn the optimal policy, \(\pi^*\), for behaving (acting) in an environment to accomplish a specific task. Policies are simply functions that return an action to execute when in a given state. When policies are stochastic, they become distributions, such that \(\pi = \pi(a | s)\), and when they are deterministic they simply compute \(a = \pi(s)\).

There are two main approaches to solving RL tasks, or, to be precise, two ways of thinking about obtaining policies…

RL based on consulting learned optimal value functions, \(q^*(a, s)\), to determine the next action is commonly used for discrete action spaces. This approach to RL first determines the value of being in each state of the environment and executing a specific action when in that state. Then, the policy becomes a function that always selects the action that maximizes this value function, \(\pi: a = \text{argmax}(q^*(a, s))\). But we can also skip the step of getting to know value functions. We can directly learn the parameters, \(\pmb{\theta}\), of the policy \(\pi(a | s, \pmb{\theta})\).

Often, to navigate real-world systems we would like to be able to take continuous actions as opposed to discrete actions. The former is a much more powerful approach for practical engineering applications because control variables of real-world systems are often continuous—think of velocity, acceleration, torque, direction, temperature, etc. A single action can then immediately scale to the necessary magnitude, instead of requiring several incremental actions to reach the same magnitude. This concept is visualized in the two figures below on the example of spatial navigation toward a target being the maximum in a scalar function. Many less actions are required to reach the target if the agent is allowed to move in the continuous space of (angle, direction) rather than a discrete space of (up, down, right, left, stay). Notice that the RL agent trained in the continuous action space first traverses the low scalar value region with larger step sizes, until it approaches the local maximum, where it starts to take finer steps.

Discrete action space: Action is one of up, down, right, left, or stay. About 200 actions were needed to reach the local maximum in the scalar value.

Continuous action space: Action is a tuple composed of angle and step size. About 15 actions were needed to reach the local maximum in the scalar value.

In this post, I show you how to code the policy gradient method which allows an RL agent to navigate continuous action spaces by sampling them from learned probability distributions!

Let’s talk about sigmoid…

…because what I really want to talk about is softmax.

We’ll start by understanding the softmax function.