-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clips actions to large limits before applying them to the environment #984
base: main
Are you sure you want to change the base?
Clips actions to large limits before applying them to the environment #984
Conversation
Slightly related issue: #673 |
source/extensions/omni.isaac.lab/omni/isaac/lab/envs/direct_rl_env_cfg.py
Outdated
Show resolved
Hide resolved
source/extensions/omni.isaac.lab/omni/isaac/lab/envs/manager_based_env_cfg.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also change the gym.spaces
range to these bounds?
In my opinion, RL libraries should take care of this and no Isaac Lab. |
@Toni-SM I agree. We should move this to the environment wrappers (similar to what we do for RL-Games): Regarding, the action/obs space design for the environments, I think it is better to do that as its own separate thing. The current fix in this MR is at least critical for the continuous learning tasks as users otherwise get "NaNs" from the simulation due to the policy feedback loop (large action into observations that then lead to larger action predictions - which eventually cause the sim to go unstable). So I'd prefer if we don't block this fix itself. |
@@ -127,3 +127,9 @@ class DirectRLEnvCfg: | |||
|
|||
Please refer to the :class:`omni.isaac.lab.utils.noise.NoiseModel` class for more details. | |||
""" | |||
|
|||
action_bounds: list[float] = [-100, 100] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious where is [-100, 100] from? I wonder if it's best to leave this user-specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 100 limits, comes from our internal codebase, still from legged gym.
I was considering having it None or Inf by default, but then users need to consciously set this value, and I think most people that have training stability issues will probably not think about that.
Could set it to None and add a FAQ to the docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can set to -inf, inf
…_env_cfg.py Co-authored-by: Mayank Mittal <[email protected]> Signed-off-by: renezurbruegg <[email protected]>
…ased_env_cfg.py Co-authored-by: Mayank Mittal <[email protected]> Signed-off-by: renezurbruegg <[email protected]>
@renezurbruegg Would you be able to help move the changes to the wrappers? |
This will introduce "arbitrary" bounds of -100,100 to any new user that merges this PR, which could lead to unexpected behaviour. How should this be addressed? In my opinion there are three options:
I personally prefer option (3). |
Please, note that current implementation is in conflict with #1117 for the direct workflow |
Can these changes here directly be integrated in #1117 then? |
@renezurbruegg , as I commented previously, in my opinion the RL libraries should take care of this and no Isaac Lab. For example, using skrl you can set model parameter However, if the target library is not able to take care of that, the option number 3 (which will not prevent the training from throwing an exception after a certain time of execution) you mentioned, or the clipping of the action directly in the task implementation for critical cases, could be a solution. |
Description
Currently, the actions from the policy are directly applied to the environment and also often fed back to the policy using the last action as observation.
Doing this, can lead to instability during training, since applying a large action can introduce a negative feedback loop.
More specifically, applying a very large action leads to a large last_action observations, which often results in a large error in the critic, which can lead to even larger actions being sampled in the future.
This PR aims to fix this, by clipping the actions to (large) hard limits before applying them to the environment. This prohibits the actions from growing continuously and - in my case - greatly improves training stability.
Type of change
TODO
Checklist
pre-commit
checks with./isaaclab.sh --format
config/extension.toml
fileCONTRIBUTORS.md
or my name already exists there