, we explored how to extend Reinforcement Learning (RL) beyond the tabular setting using function approximation. While this allowed us to generalize across states, our experiments also revealed an important limitation: in simple environments like GridWorld, approximate methods can struggle to match the stability and efficiency of tabular approaches. The main reason is that learning a good representation is itself a difficult problem—one that can outweigh the benefits of generalization when the state space is still relatively small.
To truly unlock the power of function approximation, we therefore need to move to environments where tabular methods are no longer viable. This naturally leads us to multi-player games, where the state space grows combinatorically and generalization becomes essential – and at the same time perfectly fits into this post series, as so far we did not manage to learn any meaningful behavior on more complex multi-player environments. In this post, we take this step by considering the classic game of Connect Four and investigate how to learn strong policies using Deep Q-Learning.
From Sarsa to Deep Q-Learning
To tackle this task, we extend our framework along several important dimensions.
First, we move from online updates to a batched training setup. In our earlier implementation of Sarsa, we updated the model after every transition. While faithful to the original algorithm [1], this approach is computationally inefficient: each optimizer step incurs a non-trivial cost, and modern hardware—especially GPUs—is designed to operate on batches with only marginal additional overhead.
To address this, we introduce a replay buffer. Instead of updating immediately, we store transitions as they are encountered—either up to a fixed capacity or, in our case, until one or multiple games have finished. We then perform a batched update over this collected experience. This not only improves computational efficiency but also stabilizes learning by reducing the variance of individual updates.
At this point, an important conceptual shift occurs. By sampling from past experience rather than strictly following the current policy, we move away from Sarsa—an on-policy method—towards Q-learning, which is off-policy. While we have not formally reintroduced Q-learning in the function approximation setting here, the extension from the tabular case is largely straightforward. This combination of replay buffers and Q-learning forms the foundation of Deep Q-Networks (DQNs), popularized by DeepMind in their seminal work on Atari games [2].
Finally, we turn to scalability. Reinforcement learning is inherently data-hungry, so increasing throughput is crucial. To this end, we implement a vectorized environment wrapper that allows us to simulate multiple games of Connect Four in parallel. Concretely, a single call to step(a) now processes a batch of actions and advances all environments simultaneously.
In practice, however, achieving true parallelism in Python is non-trivial. The Global Interpreter Lock (GIL) ensures that only one thread executes Python bytecode at a time, which limits the effectiveness of multi-threading for CPU-bound workloads such as environment stepping. We also experimented with multi-processing, but found that the additional overhead (e.g., inter-process communication) largely offset any gains in our setting. For the interested reader I recommend an earlier post of mine.
Despite these limitations, the combination of batched updates and environment vectorization yields a substantial improvement in throughput, increasing performance to approximately 50–100 games per second.
Implementation
In this post, I deliberately avoid going into too much detail on the environment vectorization and instead focus on the RL aspects. Partly, this is because the vectorization itself is “just” an implementation detail—but also because, in all honesty, our current setup is not ideal. Much of this is due to limitations imposed by the PettingZoo environment we are using.
In future posts, we will explore different environments and revisit this topic with a stronger emphasis on scalability—a crucial aspect of modern reinforcement learning. For a more detailed discussion of how we structure multi-player environments, manage agents, and maintain an opponent pool, I refer to my earlier post on multi-player RL. The vectorized setup used here is simply an extension of that framework to multiple games running in parallel. As always, the full implementation is available on GitHub.
Revisiting Q-Learning
Let us briefly revisit Q-learning and connect it to our implementation.
The core update rule is given by:

In contrast to Sarsa, which uses the action actually taken in the next state, Q-learning uses a max operator over all possible next actions. This makes it off-policy, as the update does not depend on the behavior policy used to generate the data. In practice, this often leads to faster propagation of value information, especially in deterministic environments such as board games.
When combined with neural networks, this approach is commonly referred to as Deep Q-Learning. Instead of maintaining a table of values, we train a neural network Qθ(s,a)Q_\theta(s,a) to approximate the action-value function. The update is then implemented as a regression problem, minimizing the difference between the current estimate and a bootstrapped target:

In our implementation, this corresponds directly to the batch_update function. Given a batch of transitions (s,a,r,s′,done)(s, a, r, s’, \text{done}), we first compute the predicted Q-values for the taken actions:
q = self.q(batch.states, ...)
q_sa = q.gather(1, batch.actions.unsqueeze(1)).squeeze(1)
Next, we construct the target using the maximum Q-value of the next state. Since not all actions are legal in Connect Four, we apply a mask to ensure that only valid moves are considered:
q_next = self.q(batch.next_states, ...)
q_next_masked = q_next.masked_fill(~legal, float("-inf"))
max_next = q_next_masked.max(dim=1).values
Finally, we combine the reward and the discounted next-state value, taking care to handle terminal states correctly:
target = batch.rewards + gamma * (~batch.dones).float() * max_next
The network is then trained by minimizing the Huber loss (a more robust variant of mean squared error):
loss = F.smooth_l1_loss(q_sa, target)
This batch-based formulation allows us to efficiently reuse experience collected from multiple parallel games, which is crucial for scaling to more complex environments. At the same time, it highlights a key challenge of Deep Q-Learning: the targets themselves depend on the current network, which can lead to instability during training.
For an additional reference, the official PyTorch tutorial on Deep Q-Learning provides a helpful complementary perspective.
Results
With that in place, let us turn to the results. To put them into perspective, we first recall how the tabular methods performed on this task. After 100,000 steps, most policies were still closely clustered in terms of win rate. In particular, even a random policy achieved roughly 50% win rate, indicating that none of the learned policies had managed to outperform chance in a meaningful way.

In the following experiment, we focus on two agents: our DQN and a random baseline. Due to the previously introduced “zoo” setup, the DQN is not a single fixed policy but a pool of evolving agents. We continuously add new versions and prune weaker ones, which gradually increases the overall strength of the opponent pool.
This has an important implication for interpreting the metrics:
the win rate of “DQN vs. DQN” naturally hovers around 50%, since agents of similar strength compete against each other. A more informative signal is therefore the performance of the random policy. As the DQN improves, the random agent should win less frequently.
With that in mind, let us look at the performance curve:

We observe several interesting effects. Most notably, the win rate of the random policy drops significantly faster than in the tabular setting—clear evidence that the DQN is indeed learning the game. However, after around one million steps, the improvement plateaus, with the random policy still winning roughly 20% of games.
To better understand what this means in practice, we can evaluate the learned policy against a human player. In the following example, I take the role of the red player going first:

The result is quite revealing. The agent has clearly learned to play offensively—it actively pursues its own four-in-a-row. However, it struggles with defensive play, failing to anticipate and block simple opponent threats.
This is probably a bit of a disappointment, but: we will come back to this. In future posts we will learn how to scale better, learn faster, and beat humans (at many things). Writing this post series about Sutton’s great book has been an amazing journey (although there are still a few posts left) – but we have simply outgrown the very general framework we started with to showcase all the available algorithms in Sutton’s book, covering both tabular and approximate solution methods. Thus, specialization is the way to go – and in the future we will do exactly that, writing highly efficient, custom tailored methods for different problems.
Conclusion
In this post, we moved from tabular Sarsa to Deep Q-Learning, introducing replay buffers, batched updates, and function approximation. We applied this to Connect Four, a multi-player game we previously failed to solve with tabular methods, with a clear result: our agent is no longer stuck at chance level—it learns, improves, and consistently outperforms a random policy.
But just as importantly, we also see the limits.
Even after extensive training, the agent plateaus and still exhibits clear weaknesses—most notably in defensive play. This is not just a matter of “more training.” In multi-player settings, the problem itself becomes harder: opponents evolve, the environment is no longer stationary, and learning targets keep shifting.
This is where the real challenge begins.
Up to this point, our framework—loosely following [1] —has prioritized generality and clarity. But to go further, that is no longer enough. Performance requires specialization.
In the next posts, we first continue following [1] – and then will focus on exactly that: building faster, more stable, and more scalable systems—pushing beyond simple baselines towards agents that can truly compete.
Other Posts in this Series
References