Implementation matters in Reinforcement Learning (RL), so does parallelization.
Almost every RL framework, ranging from simple to
complicated, includes code for
parallelization. However, it is often unclear which part of RL is parallelized
and which part is not until you dig into the source code. In this post, I will
discuss common parallelization practices in different parts of RL. (Hopefully)
this post can serve as a summary and offer some help when you decide your
RL framework.
Env
Env
VecEnv
BatchObs
Env
First, let us talk about parallelizing steps in environments.
This perhaps is the easiest one to implement and is regardless of policy or
algorithm.
Usually it is done by implementing vectorized environments where each
environment lives in different processes.
Speedup is gained if environment is slow to step/reset, otherwise it may be
more efficient to simply use a for-loop to step the environments.
Note that synchronization is needed in order to return a batch of observations.
Therefore, it is not suitable for environments with heterogeneous workloads
(different environments can take orders of magnitude more/less time to
simulate).
In that case, synchronization will cause delays due to stragglers.
While asynchronous parallelization can avoid such delays, the batch size of
the returned observations will no longer be a fixed value.
A timeout argument can be used
to trade off between asynchrony and batch size.
Obs
Obs
BatchObs
Network
Obs
One reason we want to use vectorized environments is that we can feed a batch of
observations at once to the neural network for processing, in order to achieve
speedup. This kind of parallelization is often implicitly implemented (
within parallelizing environment steps or parallelizing agent steps), though in
my opinion, it is more clear to separate them.
First, batching observations applies to both actor and learner, while
parallelizing environments focuses on experience collection. Second, how you
batch observations does not depend on the parallelization mode of environments.
For example, you can call env.step() sequentially and still batch collected
observations together. Or you may use several BatchObs to build a larger batch.
Agent
Agent
Buffer / Replay / Parameter Server
Agent
Next, let us look at parallelizing agents. Here "agent" is a general word that
may refer to "actor" (that collects trajectories) or "worker" (that handles
both experience collection and learning). Parallelization in this part of RL is
most actively researched and benchmarked. In the "actor" context, multiple
(distributed) actors collect data during interacting with the envrionment and
send data to a buffer (with replaying if using off-policy algorithms). Actors
can act synchronously, i.e. waiting for all actors to finish before proceeding
to the next stage, as in Batched A2C and PPO. Or they can act asynchronously, as in IMPALA and Ape-X. To update
weights, the actors synchronously/asynchronously retrieve latest weights from
the learner before each round of experience collection. In the "worker" context,
multiple (distributed) workers collect data and learn locally, then
asynchronously send local gradient to a parameter server to update global
parameters (as in A3C) or reduce the gradients from
different workers to perform a synchronous update (as in DD-PPO and MPI-based multiprocessing).
|
env steps (sync)
|
env steps (async)
|
batch obs
|
actor steps (sync)
|
actor steps (async)
|
worker steps (sync)
|
worker steps (async)
|
RLlib
|
|
|
|
|
|
|
|
rlpyt
|
|
|
|
|
|
|
|
Baselines
|
|
|
|
|
|
|
|
Tianshou
|
|
|
|
|
|
|
|
As far as I have searched, RLlib is the framework that
supports most parallelization modes (maybe that is why it is so complicated
). The table above gives a more clear comparison
between several RL frameworks. In particular, RLlib provides a great scaling guide that is worth
reading even if you are using other frameworks.
|