Parallelization in Reinforcement Learning

2 Sep. 2020

Implementation matters in Reinforcement Learning (RL), so does parallelization. Almost every RL framework, ranging from simple¹ to complicated², includes code for parallelization. However, it is often unclear which part of RL is parallelized and which part is not until you dig into the source code. In this post, I will discuss common parallelization practices in different parts of RL. (Hopefully) this post can serve as a summary and offer some help when you decide your RL framework.

Env
Env VecEnv BatchObs
Env

First, let us talk about parallelizing steps in environments. This perhaps is the easiest one to implement and is regardless of policy or algorithm. Usually it is done by implementing vectorized environments where each environment lives in different processes. Speedup is gained if environment is slow to step/reset, otherwise it may be more efficient to simply use a for-loop to step the environments. Note that synchronization is needed in order to return a batch of observations. Therefore, it is not suitable for environments with heterogeneous workloads (different environments can take orders of magnitude more/less time to simulate)³. In that case, synchronization will cause delays due to stragglers. While asynchronous parallelization can avoid such delays, the batch size of the returned observations will no longer be a fixed value. A timeout argument can be used to trade off between asynchrony and batch size.

Obs
Obs BatchObs Network
Obs

One reason we want to use vectorized environments is that we can feed a batch of observations at once to the neural network for processing, in order to achieve speedup. This kind of parallelization is often implicitly implemented ( within parallelizing environment steps or parallelizing agent steps), though in my opinion, it is more clear to separate them. First, batching observations applies to both actor and learner, while parallelizing environments focuses on experience collection. Second, how you batch observations does not depend on the parallelization mode of environments. For example, you can call env.step() sequentially and still batch collected observations together. Or you may use several BatchObs to build a larger batch.

Agent
Agent Buffer / Replay / Parameter Server
Agent

Next, let us look at parallelizing agents. Here "agent" is a general word that may refer to "actor" (that collects trajectories) or "worker" (that handles both experience collection and learning). Parallelization in this part of RL is most actively researched and benchmarked. In the "actor" context, multiple (distributed) actors collect data during interacting with the envrionment and send data to a buffer (with replaying if using off-policy algorithms). Actors can act synchronously, i.e. waiting for all actors to finish before proceeding to the next stage, as in Batched A2C⁴ and PPO⁵. Or they can act asynchronously, as in IMPALA⁶ and Ape-X⁷. To update weights, the actors synchronously/asynchronously retrieve latest weights from the learner before each round of experience collection. In the "worker" context, multiple (distributed) workers collect data and learn locally, then asynchronously send local gradient to a parameter server to update global parameters (as in A3C⁸) or reduce the gradients from different workers to perform a synchronous update (as in DD-PPO⁹ and MPI-based multiprocessing^10,11).

	env steps (sync)	env steps (async)	batch obs	actor steps (sync)	actor steps (async)	worker steps (sync)	worker steps (async)
RLlib
rlpyt
Baselines
Tianshou

As far as I have searched, RLlib is the framework that supports most parallelization modes (maybe that is why it is so complicated ). The table above gives a more clear comparison between several RL frameworks. In particular, RLlib provides a great scaling guide that is worth reading even if you are using other frameworks.