Releases: DLR-RM/stable-baselines3
Bug fixes, better image support and last release before v1.0
Breaking Changes:
evaluate_policy
now returns rewards/episode lengths from aMonitor
wrapper if one is present,
this allows to return the unnormalized reward in the case of Atari games for instance.- Renamed
common.vec_env.is_wrapped
tocommon.vec_env.is_vecenv_wrapped
to avoid confusion
with the newis_wrapped()
helper - Renamed
_get_data()
to_get_constructor_parameters()
for policies (this affects independent saving/loading of policies) - Removed
n_episodes_rollout
and merged it withtrain_freq
, which now accepts a tuple(frequency, unit)
: replay_buffer
incollect_rollout
is no more optional
# SB3 < 0.11.0
# model = SAC("MlpPolicy", env, n_episodes_rollout=1, train_freq=-1)
# SB3 >= 0.11.0:
model = SAC("MlpPolicy", env, train_freq=(1, "episode"))
New Features:
- Add support for
VecFrameStack
to stack on first or last observation dimension, along with
automatic check for image spaces. VecFrameStack
now has achannels_order
argument to tell if observations should be stacked
on the first or last observation dimension (originally always stacked on last).- Added
common.env_util.is_wrapped
andcommon.env_util.unwrap_wrapper
functions for checking/unwrapping
an environment for specific wrapper. - Added
env_is_wrapped()
method forVecEnv
to check if its environments are wrapped
with given Gym wrappers. - Added
monitor_kwargs
parameter tomake_vec_env
andmake_atari_env
- Wrap the environments automatically with a
Monitor
wrapper when possible. EvalCallback
now logs the success rate when available (is_success
must be present in the info dict)- Added new wrappers to log images and matplotlib figures to tensorboard. (@zampanteymedio)
- Add support for text records to
Logger
. (@lorenz-h)
Bug Fixes:
- Fixed bug where code added VecTranspose on channel-first image environments (thanks @qxcv)
- Fixed
DQN
predict method when using singlegym.Env
withdeterministic=False
- Fixed bug that the arguments order of
explained_variance()
inppo.py
anda2c.py
is not correct (@thisray) - Fixed bug where full
HerReplayBuffer
leads to an index error. (@megan-klaiber) - Fixed bug where replay buffer could not be saved if it was too big (> 4 Gb) for python<3.8 (thanks @hn2)
- Added informative
PPO
construction error in edge-case scenario wheren_steps * n_envs = 1
(size of rollout buffer),
which otherwise causes downstream breaking errors in training (@decodyng) - Fixed discrete observation space support when using multiple envs with A2C/PPO (thanks @ardabbour)
- Fixed a bug for TD3 delayed update (the update was off-by-one and not delayed when
train_freq=1
) - Fixed numpy warning (replaced
np.bool
withbool
) - Fixed a bug where
VecNormalize
was not normalizing the terminal observation - Fixed a bug where
VecTranspose
was not transposing the terminal observation - Fixed a bug where the terminal observation stored in the replay buffer was not the right one for off-policy algorithms
- Fixed a bug where
action_noise
was not used when usingHER
(thanks @ShangqunYu) - Fixed a bug where
train_freq
was not properly converted when loading a saved model
Others:
- Add more issue templates
- Add signatures to callable type annotations (@ernestum)
- Improve error message in
NatureCNN
- Added checks for supported action spaces to improve clarity of error messages for the user
- Renamed variables in the
train()
method ofSAC
,TD3
andDQN
to match SB3-Contrib. - Updated docker base image to Ubuntu 18.04
- Set tensorboard min version to 2.2.0 (earlier version are apparently not working with PyTorch)
- Added warning for
PPO
whenn_steps * n_envs
is not a multiple ofbatch_size
(last mini-batch truncated) (@decodyng) - Removed some warnings in the tests
Documentation:
- Updated algorithm table
- Minor docstring improvements regarding rollout (@stheid)
- Fix migration doc for
A2C
(epsilon parameter) - Fix
clip_range
docstring - Fix duplicated parameter in
EvalCallback
docstring (thanks @tfederico) - Added example of learning rate schedule
- Added SUMO-RL as example project (@LucasAlegre)
- Fix docstring of classes in atari_wrappers.py which were inside the constructor (@LucasAlegre)
- Added SB3-Contrib page
- Fix bug in the example code of DQN (@AptX395)
- Add example on how to access the tensorboard summary writer directly. (@lorenz-h)
- Updated migration guide
- Updated custom policy doc (separate policy architecture recommended)
- Added a note about OpenCV headless version
- Corrected typo on documentation (@mschweizer)
- Provide the environment when loading the model in the examples (@lorepieri8)
HER with online and offline sampling, bug fixes for features extraction
Breaking Changes
- Warning: Renamed
common.cmd_util
tocommon.env_util
for clarity (affectsmake_vec_env
andmake_atari_env
functions)
New Features
- Allow custom actor/critic network architectures using
net_arch=dict(qf=[400, 300], pi=[64, 64])
for off-policy algorithms (SAC, TD3, DDPG) - Added Hindsight Experience Replay
HER
. (@megan-klaiber) VecNormalize
now supportsgym.spaces.Dict
observation spaces- Support logging videos to Tensorboard (@SwamyDev)
- Added
share_features_extractor
argument toSAC
andTD3
policies
Bug Fixes
- Fix GAE computation for on-policy algorithms (off-by one for the last value) (thanks @Wovchena)
- Fixed potential issue when loading a different environment
- Fix ignoring the exclude parameter when recording logs using json, csv or log as logging format (@SwamyDev)
- Make
make_vec_env
support theenv_kwargs
argument when using an env ID str (@ManifoldFR) - Fix model creation initializing CUDA even when
device="cpu"
is provided - Fix
check_env
not checking if the env has a Dict actionspace before calling_check_nan
(@wmmc88) - Update the check for spaces unsupported by Stable Baselines 3 to include checks on the action space (@wmmc88)
- Fixed feature extractor bug for target network where the same net was shared instead
of being separate. This bug affectsSAC
,DDPG
andTD3
when usingCnnPolicy
(or custom feature extractor) - Fixed a bug when passing an environment when loading a saved model with a
CnnPolicy
, the passed env was not wrapped properly
(the bug was introduced when implementingHER
so it should not be present in previous versions)
Others
- Improved typing coverage
- Improved error messages for unsupported spaces
- Added
.vscode
to the gitignore
Documentation
Bug fixes, get/set parameters and improved docs
Breaking Changes:
- Removed
device
keyword argument of policies; usepolicy.to(device)
instead. (@qxcv) - Rename
BaseClass.get_torch_variables
->BaseClass._get_torch_save_params
and
BaseClass.excluded_save_params
->BaseClass._excluded_save_params
- Renamed saved items
tensors
topytorch_variables
for clarity make_atari_env
,make_vec_env
andset_random_seed
must be imported with (and not directly fromstable_baselines3.common
):
from stable_baselines3.common.cmd_util import make_atari_env, make_vec_env
from stable_baselines3.common.utils import set_random_seed
New Features:
- Added
unwrap_vec_wrapper()
tocommon.vec_env
to extractVecEnvWrapper
if needed - Added
StopTrainingOnMaxEpisodes
to callback collection (@xicocaio) - Added
device
keyword argument toBaseAlgorithm.load()
(@liorcohen5) - Callbacks have access to rollout collection locals as in SB2. (@partiallytyped)
- Added
get_parameters
andset_parameters
for accessing/setting parameters of the agent - Added actor/critic loss logging for TD3. (@mloo3)
Bug Fixes:
- Fixed a bug where the environment was reset twice when using
evaluate_policy
- Fix logging of
clip_fraction
in PPO (@diditforlulz273) - Fixed a bug where cuda support was wrongly checked when passing the GPU index, e.g.,
device="cuda:0"
(@liorcohen5) - Fixed a bug when the random seed was not properly set on cuda when passing the GPU index
Others:
- Improve typing coverage of the
VecEnv
- Fix type annotation of
make_vec_env
(@ManifoldFR) - Removed
AlreadySteppingError
andNotSteppingError
that were not used - Fixed typos in SAC and TD3
- Reorganized functions for clarity in
BaseClass
(save/load functions close to each other, private
functions at top) - Clarified docstrings on what is saved and loaded to/from files
- Simplified
save_to_zip_file
function by removing duplicate code - Store library version along with the saved models
- DQN loss is now logged
Documentation:
- Added
StopTrainingOnMaxEpisodes
details and example (@xicocaio) - Updated custom policy section (added custom feature extractor example)
- Re-enable
sphinx_autodoc_typehints
- Updated doc style for type hints and remove duplicated type hints
Added DQN and DDPG, bug fixes and performance matching for Atari games
Breaking Changes:
AtariWrapper
and other Atari wrappers were updated to match SB2 onessave_replay_buffer
now receives as argument the file path instead of the folder path (@tirafesi)- Refactored
Critic
class forTD3
andSAC
, it is now calledContinuousCritic
and has an additional parametern_critics
SAC
andTD3
now accept an arbitrary number of critics (e.g.policy_kwargs=dict(n_critics=3)
)
instead of only 2 previously
New Features:
- Added
DQN
Algorithm (@Artemis-Skade) - Buffer dtype is now set according to action and observation spaces for
ReplayBuffer
- Added warning when allocation of a buffer may exceed the available memory of the system
whenpsutil
is available - Saving models now automatically creates the necessary folders and raises appropriate warnings (@partiallytyped)
- Refactored opening paths for saving and loading to use strings, pathlib or io.BufferedIOBase (@partiallytyped)
- Added
DDPG
algorithm as a special case ofTD3
. - Introduced
BaseModel
abstract parent forBasePolicy
, which critics inherit from.
Bug Fixes:
- Fixed a bug in the
close()
method ofSubprocVecEnv
, causing wrappers further down in the wrapper stack to not be closed. (@NeoExtended) - Fix target for updating q values in SAC: the entropy term was not conditioned by terminals states
- Use
cloudpickle.load
instead ofpickle.load
inCloudpickleWrapper
. (@shwang) - Fixed a bug with orthogonal initialization when
bias=False
in custom policy (@rk37) - Fixed approximate entropy calculation in PPO and A2C. (@AndyShih12)
- Fixed DQN target network sharing feature extractor with the main network.
- Fixed storing correct
dones
in on-policy algorithm rollout collection. (@AndyShih12) - Fixed number of filters in final convolutional layer in NatureCNN to match original implementation.
Others:
- Refactored off-policy algorithm to share the same
.learn()
method - Split the
collect_rollout()
method for off-policy algorithms - Added
_on_step()
for off-policy base class - Optimized replay buffer size by removing the need of
next_observations
numpy array - Optimized polyak updates (1.5-1.95 speedup) through inplace operations (@partiallytyped)
- Switch to
black
codestyle and addedmake format
,make check-codestyle
andcommit-checks
- Ignored errors from newer pytype version
- Added a check when using
gSDE
- Removed codacy dependency from Dockerfile
- Added
common.sb2_compat.RMSpropTFLike
optimizer, which corresponds closer to the implementation of RMSprop from Tensorflow.
Documentation:
- Updated notebook links
- Fixed a typo in the section of Enjoy a Trained Agent, in RL Baselines3 Zoo README. (@blurLake)
- Added Unity reacher to the projects page (@koulakis)
- Added PyBullet colab notebook
- Fixed typo in PPO example code (@joeljosephjin)
- Fixed typo in custom policy doc (@RaphaelWag)
Hotfix for PPO/A2C + gSDE, internal refactoring and bug fixes
Breaking Changes:
-
render()
method ofVecEnvs
now only accept one argument:mode
-
Created new file common/torch_layers.py, similar to SB refactoring
- Contains all PyTorch network layer definitions and feature extractors:
MlpExtractor
,create_mlp
,NatureCNN
- Contains all PyTorch network layer definitions and feature extractors:
-
Renamed
BaseRLModel
toBaseAlgorithm
(along with offpolicy and onpolicy variants) -
Moved on-policy and off-policy base algorithms to
common/on_policy_algorithm.py
andcommon/off_policy_algorithm.py
, respectively. -
Moved
PPOPolicy
toActorCriticPolicy
in common/policies.py -
Moved
PPO
(algorithm class) intoOnPolicyAlgorithm
(common/on_policy_algorithm.py
), to be shared with A2C -
Moved following functions from
BaseAlgorithm
:_load_from_file
toload_from_zip_file
(save_util.py)_save_to_file_zip
tosave_to_zip_file
(save_util.py)safe_mean
tosafe_mean
(utils.py)check_env
tocheck_for_correct_spaces
(utils.py. Renamed to avoid confusion with environment checker tools)
-
Moved static function
_is_vectorized_observation
from common/policies.py to common/utils.py under nameis_vectorized_observation
. -
Removed
{save,load}_running_average
functions ofVecNormalize
in favor ofload/save
. -
Removed
use_gae
parameter fromRolloutBuffer.compute_returns_and_advantage
.
Bug Fixes:
- Fixed
render()
method forVecEnvs
- Fixed
seed()
method forSubprocVecEnv
- Fixed loading on GPU for testing when using gSDE and
deterministic=False
- Fixed
register_policy
to allow re-registering same policy for same sub-class (i.e. assign same value to same key). - Fixed a bug where the gradient was passed when using
gSDE
withPPO
/A2C
, this does not affectSAC
Others:
- Re-enable unsafe
fork
start method in the tests (was causing a deadlock with tensorflow) - Added a test for seeding
SubprocVecEnv
and rendering - Fixed reference in NatureCNN (pointed to older version with different network architecture)
- Fixed comments saying "CxWxH" instead of "CxHxW" (same style as in torch docs / commonly used)
- Added bit further comments on register/getting policies ("MlpPolicy", "CnnPolicy").
- Renamed
progress
(value from 1 in start of training to 0 in end) toprogress_remaining
. - Added
policies.py
files for A2C/PPO, which define MlpPolicy/CnnPolicy (renamed ActorCriticPolicies). - Added some missing tests for
VecNormalize
,VecCheckNan
andPPO
.
Documentation:
- Added a paragraph on "MlpPolicy"/"CnnPolicy" and policy naming scheme under "Developer Guide"
- Fixed second-level listing in changelog
Tensorboard support, refactored logger
Breaking Changes:
- Remove State-Dependent Exploration (SDE) support for
TD3
- Methods were renamed in the logger:
logkv
->record
,writekvs
->write
,writeseq
->write_sequence
,logkvs
->record_dict
,dumpkvs
->dump
,getkvs
->get_log_dict
,logkv_mean
->record_mean
,
New Features:
- Added env checker (Sync with Stable Baselines)
- Added
VecCheckNan
andVecVideoRecorder
(Sync with Stable Baselines) - Added determinism tests
- Added
cmd_util
andatari_wrappers
- Added support for
MultiDiscrete
andMultiBinary
observation spaces (@rolandgvc) - Added
MultiCategorical
andBernoulli
distributions for PPO/A2C (@rolandgvc) - Added support for logging to tensorboard (@rolandgvc)
- Added
VectorizedActionNoise
for continuous vectorized environments (@partiallytyped) - Log evaluation in the
EvalCallback
using the logger
Bug Fixes:
- Fixed a bug that prevented model trained on cpu to be loaded on gpu
- Fixed version number that had a new line included
- Fixed weird seg fault in docker image due to FakeImageEnv by reducing screen size
- Fixed
sde_sample_freq
that was not taken into account for SAC - Pass logger module to
BaseCallback
otherwise they cannot write in the one used by the algorithms
Others:
- Renamed to Stable-Baseline3
- Added Dockerfile
- Sync
VecEnvs
with Stable-Baselines - Update requirement:
gym>=0.17
- Added
.readthedoc.yml
file - Added
flake8
andmake lint
command - Added Github workflow
- Added warning when passing both
train_freq
andn_episodes_rollout
to Off-Policy Algorithms
Documentation:
- Added most documentation (adapted from Stable-Baselines)
- Added link to CONTRIBUTING.md in the README (@kinalmehta)
- Added gSDE project and update docstrings accordingly
- Fix
TD3
example code block