Control of a bioreactor using reinforcement learning

Zero-dimensional model of a bioreactor

A bioreactor is a dynamical system that can be modeled as a zero-dimensional, perfectly stirred reactor where biomass is produced from a substrate. The substrate is continually added to the bioreactor at the rate \(F\) (in \(L/h\)), and the products of the bioreactor are expelled at the same rate, thus maintaining a constant volume, \(V\), in the reactor.

The system of initial value ordinary differential equations (ODEs) that describes this process can be written as:

\(\begin{equation} \begin{cases} \frac{d X}{d t} = \left( \mu(S) - D \right) X \\ \frac{d S}{d t} = D \left( S_{\text{in}} - S \right) - \frac{\mu(S)}{Y_d} X \end{cases} \end{equation}\)

with the initial conditions

\(\begin{equation} \begin{cases} X(t = 0) = X_0 \\ S(t = 0) = S_0 \end{cases} \end{equation}\)

and where \(X\) is the biomass concentration in \(g/L\), \(S\) is the substrate concentration in \(g/L\), \(S_{\text{in}}\) is the substrate concentration in the feed in \(g/L\), \(D = F / V\) is the dilution rate in \(1/h\), \(\mu(S)\) is the biomass growth rate in \(1/h\), and \(Y_d\) is the yield coefficient specifying how much biomass can be obtained from the unit mass of substrate.

The biomass growth rate is often modeled using the Monod’s model [1]:

\(\begin{equation} \mu(S) = \mu_{\text{max}} \frac{S}{K_S + S} \end{equation}\)

where \(\mu_{\text{max}}\) is the maximum growth rate in \(1/h\) and \(K_S\) is the half-saturation constant in \(g/L\).

For more realism, we may also model gradual decay of biomass if substrate is not continually provided. With the current ODE model, \(X\) stays constant once \(D \rightarrow 0\) and \(S \rightarrow 0\). Realistically, without new substrate biomass eventually starves. This can be modeled by an additional decay term, \(k_d\), in the first ODE:

\(\begin{equation} \frac{d X}{d t} = \left( \mu(S) - D - k_d \right) X \end{equation}\)

Bioreactor parameters

We are going to use the following parameters for the bioreactor:

Parameter	Value
\(\mu_{\text{max}}\)	\(0.4 \,\, 1/h\)
\(K_S\)	\(0.1 \,\, g/L\)
\(S_{\text{in}}\)	\(10.0 \,\, g/L\)
\(k_d\)	\(0.01 \,\, 1/h\)
\(Y_d\)	\(0.5 \,\, gX/gS\)
Max. \(X\)	\(10.0 \,\, g/L\)
Max. \(S\)	\(20.0 \,\, g/L\)

Control action

This dynamical system is controlled by establishing the right dilution rate, \(D\), such that we maximize the rate of biomass expelled from the reactor, \(D X\), also known as the reactor’s productivity. Note that too low \(D\) will hamper the growth of biomass with too little nutrients provided and equally small biomass output. But too high \(D\) can lead to reactor washout, i.e., complete removal of biomass from the reactor with time.

Below, we visualize a couple of scenarios starting from initial condition

\(\begin{equation} \begin{cases} X(t = 0) = 1.0 \frac{g}{L} \\ S(t = 0) = 5.0 \frac{g}{L} \end{cases} \end{equation}\)

Good control

First, a good control case where dilution rate is not too low and not too high. With just the right amount of substrate continually supplied to the reactor the biomass concentration establishes a steady state at some high value.

Zero dilution rate

Second, zero dilution rate causes the biomass to gradually decay due to starvation.

High dilution rate

Third, too high dilution rate causes washout, where at some point in time the biomass is completely removed from the bioreactor and its production cannot be restored.

Setting up reinforcement learning training

We are going to use the following training parameters:

Parameter	Value
Optimizer	Adam
Initial \(\alpha\)	\(1 \cdot 10^{-4}\)
Final \(\alpha\)	\(1 \cdot 10^{-5}\)
\(\alpha\) decay	Cosine
Discount factor, \(\gamma\)	\(0.99\)
Policy network architecture	in-64-64-out
Activation function	ReLU
Number of episodes	2000
Random seed	100

The state is determined by five values that are inputs to the policy network:

The current \(X\)
The current \(S\)
The difference between the current and previous \(X\), \(\Delta X\)
The difference between the current and previous \(S\), \(\Delta S\)
The previous action, \(D_{\text{prev}}\)

The continuous action is the value for \(D\) which we assume bounded between \(0.0 \,\, 1/h\) and \(1.0 \,\, 1/h\). This is the output of the policy network.

The reward is computed as \(D \cdot X\).

We assume that washout occurs if \(X\) drops below \(10^{-3} g/L\) and is penalized with an additional \(-10.0\) reward.

REINFORCE model for bioreactor control

We are using the REINFORCE [2] agent from TF-Agents [3]:

agent = reinforce_agent.ReinforceAgent(time_step_spec=train_env.time_step_spec(),
                                       action_spec=train_env.action_spec(),
                                       actor_network=actor_net,
                                       optimizer=agent_optimizer,
                                       train_step_counter=train_step_counter, 
                                       gamma=discount_factor, 
                                       normalize_returns=True, 
                                       entropy_regularization=None)

where the policy network is a fully-connected dense neural network:

actor_net = actor_distribution_network.ActorDistributionNetwork(
    input_tensor_spec=train_env.observation_spec(),
    output_tensor_spec=train_env.action_spec(),
    fc_layer_params=(64, 64),
    activation_fn=tf.keras.activations.relu,
    kernel_initializer=tf.keras.initializers.GlorotUniform(seed=random_seed),
    seed=random_seed,
)

Below, we show how the average per-step reward develops during the 2000 training episodes. Note that the theoretical maximum instantaneous reward is \(D_{\text{max}} \cdot X_{\text{max}} = 1 \,\, 1/h \cdot 10.0 \,\, g/L = 10.0 \frac{g}{Lh}\), but in practice, this may not lead to the optimal operating conditions of the bioreactor as maintaining the highest possible \(D\) may lead to reactor washout. What the highest optimal \(D \cdot X\) is determined by the reactor’s parameters. For example, a higher yield may allow for higher \(D\) as more biomass is produced from the substrate at any instance in time and washout is less likely.

Comparison with a constant action

Post training, we compare the RL control with a constant action, \(D = \text{const}\). We find the best possible constant action that leads to the highest cumulative \(D \cdot X\), which seems to be at \(D \approx 0.3 \,\, 1/h\). The nonlinear RL control slightly outperforms this constant action for this initial condition:

\(\begin{equation} \begin{cases} X(t = 0) = 1.0 \frac{g}{L} \\ S(t = 0) = 2.0 \frac{g}{L} \end{cases} \end{equation}\)

Note that if the constant action matched the steady-state value of the RL control action, it would lead to a much worse outcome:

Challenging initial condition

Let’s also start the bioreactor at a challenging initial condition that is close to washout:

\(\begin{equation} \begin{cases} X(t = 0) = 0.002 \frac{g}{L} \\ S(t = 0) = 0.0 \frac{g}{L} \end{cases} \end{equation}\)

This time, the best constant action has to be decreased to \(D \approx 0.15 \,\, 1/h\). The RL control first decreases \(D\) in order to prevent washout, and starts to increase \(D\) only after sufficient amount of biomass is accumulated in the bioreactor.

Response to perturbations

Finally, let’s introduce some perturbation in the reactor to see how the RL agent responds. We are going to suddenly drop \(X\) in the reactor by \(3.0 g/L\). The RL agent chooses to decrease \(D\) for a while after this event most likely to avoid potential washout. However, there is a better constant action that leads to higher cumulative productivity at \(D \approx 0.3 \,\, 1/h\). There is room to improve the RL training 🙂