PPO discrete action for gym.Env #81

pavelxx1 · 2023-06-15T05:57:33Z

pavelxx1
Jun 15, 2023

Hi @Toni-SM I'm new in RL
but , do u have any example of code for PPO discrete action space?
Thx

Toni-SM · 2023-06-15T08:00:20Z

Toni-SM
Jun 15, 2023
Maintainer

Sure. The key idea is to replace the gaussian-based policy (for continuous action spaces) by a categorical-based policy (for discrete action spaces).

In the .zip file you will find two examples: for the OpenAI Gym and Farama Gymnasium environment interfaces.
ppo_cartpole_examples.zip

Note that maximum possible total reward varies between the different CartPole environment versions:

CartPole-v0: 200
CartPole-v1: 500

0 replies

pavelxx1 · 2023-06-15T11:09:17Z

pavelxx1
Jun 15, 2023
Author

Thx a lot! I will use your code as starting point of my research
And can u give example of code for agent deterministic testing after success training

0 replies

pavelxx1 · 2023-06-15T12:15:28Z

pavelxx1
Jun 15, 2023
Author

I wrote some eval code but is this right way?

import gym

import torch.nn as nn
import torch.nn.functional as F

# Import the skrl components to build the RL system
from skrl.models.torch import Model, CategoricalMixin
from skrl.agents.torch.ppo import PPO, PPO_DEFAULT_CONFIG
from skrl.trainers.torch import SequentialTrainer
from skrl.envs.torch import wrap_env



class Policy(CategoricalMixin, Model):
    def __init__(self, observation_space, action_space, device, unnormalized_log_prob=True):
        Model.__init__(self, observation_space, action_space, device)
        CategoricalMixin.__init__(self, unnormalized_log_prob)

        self.linear_layer_1 = nn.Linear(self.num_observations, 64)
        self.linear_layer_2 = nn.Linear(64, 64)
        self.output_layer = nn.Linear(64, self.num_actions)

    def compute(self, inputs, role):
        x = F.relu(self.linear_layer_1(inputs["states"]))
        x = F.relu(self.linear_layer_2(x))
        return self.output_layer(x), {}


env = wrap_env(TestEnv())
device = env.device


models_ppo = {}
models_ppo["policy"] = Policy(env.observation_space, env.action_space, device)

cfg_ppo = PPO_DEFAULT_CONFIG.copy()
cfg_ppo["random_timesteps"] = 0  
cfg_ppo["experiment"]["checkpoint_interval"] = 0

agent_ppo = PPO(models=models_ppo,
                memory=None,
                cfg=cfg_ppo,
                observation_space=env.observation_space,
                action_space=env.action_space,
                device=device)

agent_ppo.load("./rl-ckpt/23-06-15_14-52-30-759890_PPO/checkpoints/agent_100000.pt")

# Configure and instantiate the RL trainer
cfg_trainer = {"timesteps": 1000, "headless": True, "disable_progressbar": True}
trainer = SequentialTrainer(cfg=cfg_trainer, env=env, agents=agent_ppo)

# start training
trainer.eval()

-----------------------------------------------------------------
[skrl:INFO] Environment class: gym.core.Env
[skrl:INFO] Environment wrapper: Gym
[skrl:WARNING] Cannot load the value module. The agent doesn't have such an instance
[skrl:WARNING] Cannot load the optimizer module. The agent doesn't have such an instance
[skrl:WARNING] Cannot load the state_preprocessor module. The agent doesn't have such an instance
[skrl:WARNING] Cannot load the value_preprocessor module. The agent doesn't have such an instance_

0 replies

Toni-SM · 2023-06-15T14:16:44Z

Toni-SM
Jun 15, 2023
Maintainer

Hi @pavelxx1

Yes, the code for evaluation looks good.
The warnings are related to some components (Value, optimizer and preprocesors) that are only required during training. So, just ignore them.

One think, I typically load the agent checkpoint after instantiate the trainer, to make sure agent initialization (that occur when the trainer is instantiated) is done.

agent_ppo = PPO(models=models_ppo,
                memory=None,
                cfg=cfg_ppo,
                observation_space=env.observation_space,
                action_space=env.action_space,
                device=device)


# Configure and instantiate the RL trainer
cfg_trainer = {"timesteps": 1000, "headless": True, "disable_progressbar": True}
trainer = SequentialTrainer(cfg=cfg_trainer, env=env, agents=agent_ppo)

# load checkpoint
agent_ppo.load("./rl-ckpt/23-06-15_14-52-30-759890_PPO/checkpoints/agent_100000.pt")

# start evaluation
trainer.eval()

0 replies

Toni-SM · 2023-06-15T16:29:44Z

Toni-SM
Jun 15, 2023
Maintainer

Hi @pavelxx1

Looking at your previous question:

And can u give example of code for agent deterministic testing after success training

Categorical agents will act stochastic even during evaluation.
For Gaussian models: it is possible that they act deterministic, a modification of one line in the code is necessary.

5 replies

pavelxx1 Jun 16, 2023
Author

@Toni-SM , you mean I need change CategoricalMixin to GaussianMixin in Policy class?

pavelxx1 Jun 19, 2023
Author

@Toni-SM , can u answer? thx

Toni-SM Jun 19, 2023
Maintainer

Hi @pavelxx1

For Gaussian models to act deterministically (during evaluation) it is necessary to change the return of the agent's act method. This can be achieved by adjusting a line.

But better, to avoid modifying the source code, you can overwrite the .act(...) method in the user-defined model (in main script) to skip the Gaussian distribution step and return directly the deterministic value (mean_actions).

For example:

import torch
import torch.nn as nn

from skrl.models.torch import Model, GaussianMixin


# define the model
class MLP(GaussianMixin, Model):
    def __init__(self, observation_space, action_space, device,
                 clip_actions=False, clip_log_std=True, min_log_std=-20, max_log_std=2, reduction="sum"):
        Model.__init__(self, observation_space, action_space, device)
        GaussianMixin.__init__(self, clip_actions, clip_log_std, min_log_std, max_log_std, reduction)

        self.net = nn.Sequential(nn.Linear(self.num_observations, 64),
                                 nn.ReLU(),
                                 nn.Linear(64, 32),
                                 nn.ReLU(),
                                 nn.Linear(32, self.num_actions),
                                 nn.Tanh())
        self.log_std_parameter = nn.Parameter(torch.zeros(self.num_actions))

    def act(self, inputs, role):
        mean_actions, log_std, outputs = self.compute(inputs, role)
        return mean_actions, None, outputs

    def compute(self, inputs, role):
        return self.net(inputs["states"]), self.log_std_parameter, {}

pavelxx1 Jun 22, 2023
Author

@Toni-SM is this valid code for deterministic actions? thx

class Policy(CategoricalMixin, Model):
    def __init__(self, observation_space, action_space, device, unnormalized_log_prob=True):
        Model.__init__(self, observation_space, action_space, device)
        CategoricalMixin.__init__(self, unnormalized_log_prob)

        self.linear_layer_1 = nn.Linear(self.num_observations, 128)
        self.linear_layer_2 = nn.Linear(128, 64)
        self.linear_layer_3 = nn.Linear(64, 32)
        self.output_layer = nn.Linear(32, self.num_actions)

    def act(self, inputs, role):
        mean_actions, outputs = self.compute(inputs, role)
        argmax_actions = mean_actions.argmax(dim=-1)
        return argmax_actions, None, outputs      

    def compute(self, inputs, role):
        x = F.relu(self.linear_layer_1(inputs["states"]))
        x = F.relu(self.linear_layer_2(x))
        x = F.relu(self.linear_layer_3(x))
        return self.output_layer(x), {}

Toni-SM Jun 23, 2023
Maintainer

Hi @pavelxx1

Yes, for categorical models it is a valid way to generate deterministic actions. Also, it is necessary to unsqueeze the action.
A recommendation would be to rename the variable to avoid confusion.

def act(self, inputs, role):
    logits, outputs = self.compute(inputs, role)
    actions = logits.argmax(dim=-1).unsqueeze(-1)
    return actions, None, outputs

pavelxx1 · 2023-06-19T13:10:08Z

pavelxx1
Jun 19, 2023
Author

You maked my day , thx 🙏 😊

0 replies

pavelxx1 · 2023-07-04T22:44:07Z

pavelxx1
Jul 4, 2023
Author

@Toni-SM , Hi
I have an issue when set cfg_ppo["entropy_loss_scale"] = 0.05 when use CategoricalMixin

\venv\lib\site-packages\skrl\agents\torch\ppo\ppo.py in _update(self, timestep, timesteps)
    460                 # compute entropy loss
    461                 if self._entropy_loss_scale:
--> 462                     entropy_loss = -self._entropy_loss_scale * self.policy.get_entropy(role="policy").mean()
    463                 else:
    464                     entropy_loss = 0

\venv\lib\site-packages\torch\nn\modules\module.py in __getattr__(self, name)
   1612             if name in modules:
   1613                 return modules[name]
-> 1614         raise AttributeError("'{}' object has no attribute '{}'".format(
   1615             type(self).__name__, name))
   1616 

AttributeError: 'Policy' object has no attribute 'get_entropy'

3 replies

Toni-SM Jul 5, 2023
Maintainer

Hi @pavelxx1

That issue is already fixed for the next release.

For the time being you can work around this problem by adding the following method to the Categorical mixin model (in skrl/models/torch/categorical.py file)

    def get_entropy(self, role: str = "") -> torch.Tensor:
        """Compute and return the entropy of the model

        :return: Entropy of the model
        :rtype: torch.Tensor
        :param role: Role play by the model (default: ``""``)
        :type role: str, optional

        Example::

            >>> entropy = model.get_entropy()
            >>> print(entropy.shape)
            torch.Size([4096, 1])
        """
        distribution = self._c_distribution[role] if role in self._c_distribution else self._c_distribution[""]
        if distribution is None:
            return torch.tensor(0.0, device=self.device)
        return distribution.entropy().to(self.device)

pavelxx1 Jul 11, 2023
Author

That issue is already fixed for the next release.
@Toni-SM when can we expect a new skrl-release?

Toni-SM Jul 12, 2023
Maintainer

Hopefully by the end of next week.
Next release will be a major release (with JAX support, multi-agent and new documentation) :)

PPO discrete action for gym.Env #81

Uh oh!

Uh oh!

pavelxx1 Jun 15, 2023

Replies: 7 comments · 8 replies

Uh oh!

Uh oh!

Toni-SM Jun 15, 2023 Maintainer

Uh oh!

Uh oh!

pavelxx1 Jun 15, 2023 Author

Uh oh!

Uh oh!

pavelxx1 Jun 15, 2023 Author

Uh oh!

Uh oh!

Toni-SM Jun 15, 2023 Maintainer

Uh oh!

Uh oh!

Toni-SM Jun 15, 2023 Maintainer

Uh oh!

Uh oh!

pavelxx1 Jun 16, 2023 Author

Uh oh!

pavelxx1 Jun 19, 2023 Author

Uh oh!

Toni-SM Jun 19, 2023 Maintainer

Uh oh!

Uh oh!

pavelxx1 Jun 22, 2023 Author

Uh oh!

Toni-SM Jun 23, 2023 Maintainer

Uh oh!

pavelxx1 Jun 19, 2023 Author

Uh oh!

Uh oh!

pavelxx1 Jul 4, 2023 Author

Uh oh!

Toni-SM Jul 5, 2023 Maintainer

Uh oh!

pavelxx1 Jul 11, 2023 Author

Uh oh!

Toni-SM Jul 12, 2023 Maintainer

pavelxx1
Jun 15, 2023

Replies: 7 comments 8 replies

Toni-SM
Jun 15, 2023
Maintainer

pavelxx1
Jun 15, 2023
Author

pavelxx1
Jun 15, 2023
Author

Toni-SM
Jun 15, 2023
Maintainer

Toni-SM
Jun 15, 2023
Maintainer

pavelxx1 Jun 16, 2023
Author

pavelxx1 Jun 19, 2023
Author

Toni-SM Jun 19, 2023
Maintainer

pavelxx1 Jun 22, 2023
Author

Toni-SM Jun 23, 2023
Maintainer

pavelxx1
Jun 19, 2023
Author

pavelxx1
Jul 4, 2023
Author

Toni-SM Jul 5, 2023
Maintainer

pavelxx1 Jul 11, 2023
Author

Toni-SM Jul 12, 2023
Maintainer