minhphd
/

PyDreamerV1

Model card Files Files and versions Community

minhphd commited on Mar 6, 2024

Commit

ce3feed

verified ·

1 Parent(s): 46c27fd

Upload 30 files

Browse files

Files changed (31) hide show

.gitattributes +4 -0
README.md +75 -0
algos/.DS_Store +0 -0
algos/dreamer.py +489 -0
bash/setup.sh +32 -0
configs/.DS_Store +0 -0
configs/dm_control/Cart-pole.yml +64 -0
configs/dm_control/Quadruped.yml +64 -0
configs/dm_control/Walker.yml +65 -0
configs/gymnasium/Boxing-v5.yml +67 -0
configs/gymnasium/Pacman-v5.yml +66 -0
configs/gymnasium/ant_v4.yml +65 -0
configs/gymnasium/car_racing_config.yml +64 -0
gif/boxing.gif +0 -0
gif/boxing_imagine.gif +0 -0
gif/pacman.gif +3 -0
gif/pacman_imagine.gif +3 -0
gif/quadruped.gif +3 -0
gif/quadruped_imagine.gif +0 -0
gif/walker_imagine.gif +3 -0
imagine.py +88 -0
requirements.txt +143 -0
utils/.DS_Store +0 -0
utils/__pycache__/buffer.cpython-310.pyc +0 -0
utils/__pycache__/models.cpython-310.pyc +0 -0
utils/__pycache__/utils.cpython-310.pyc +0 -0
utils/__pycache__/wrappers.cpython-310.pyc +0 -0
utils/buffer.py +139 -0
utils/models.py +440 -0
utils/utils.py +41 -0
utils/wrappers.py +265 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+gif/pacman_imagine.gif filter=lfs diff=lfs merge=lfs -text
+gif/pacman.gif filter=lfs diff=lfs merge=lfs -text
+gif/quadruped.gif filter=lfs diff=lfs merge=lfs -text
+gif/walker_imagine.gif filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,75 @@

+# PyDreamerV1: Clean pytorch implementation of Hafner et al Dreamer
+<div align="center">
+  <img src="./gif/boxing.gif" alt="Actual run in " width="200px" height="200px"/>
+  <img src="./gif/quadruped.gif" alt="Actual run in " width="200px" height="200px"/>
+  <img src="./gif/walker.gif" alt="Actual run in " width="200px" height="200px"/>
+</div>
+<div align="center">
+  <img src="./gif/boxing_imagine.gif" alt="Imagination in " width="200px" height="200px"/>
+  <img src="./gif/quadruped_imagine.gif" alt="Imagination in " width="200px" height="200px"/>
+  <img src="./gif/walker_imagine.gif" alt="Imagination in " width="200px" height="200px"/>
+</div>
+This repository offers a comprehensive implementation of the Dreamer algorithm, as presented in the groundbreaking work by Hafner et al., "Dream to Control: Learning Behaviors by Latent Imagination." Our implementation is dedicated to faithfully reproducing the innovative approach of learning and planning within a learned latent space, enabling agents to efficiently master complex behaviors through imagination alone.
+## Why Dreamer?
+Dreamer stands at the forefront of model-based reinforcement learning by introducing an efficient method for learning behaviors directly from high-dimensional sensory inputs. It leverages a latent dynamics model to 'imagine' future states and rewards, enabling it to plan and execute actions that maximize long-term rewards purely from simulated experiences. This approach significantly improves sample efficiency over traditional model-free methods and opens new avenues for learning complex and nuanced behaviors in simulated environments. However, the official code was unfortunately regarded as complex and difficult to understand, and there are only a handful of Dreamer reimplementation that was able to reproduce the results.
+## Implementation Highlights
+- **Modular Design**: My implementation of the Recurrent State Space Model (RSSM) is broken down into cleanly separated modules for the transition, representation, and recurrent models. This not only facilitates a deeper understanding of the underlying mechanics but also allows for easy customization and extension.
+- **True to the Source**: By closely adhering to the methodologies detailed in the original DreamerV1 paper, the code captures the essence of latent space learning and imagination-driven planning. From the incorporation of exploration noise to the td lambda calculation, every element is designed to replicate the paper's results as closely as possible. The sets of hyperparamenters are excactly indentical to the sets mentioned in the paper
+- **Detailed Training Insights**: The training loop is separated and mirroring the paper's outline. Comprehensive comments of hidden implementation details thorough documentation accompany the code, serving as a valuable resource for both learning and further research.
+## Getting Started
+1. **Clone the Repository**: Get the code by cloning this repository to your local machine.
+   ```
+   git clone https://github.com/minhphd/PyDreamerV1
+   ```
+2. **Install Dependencies**: Ensure you have all necessary dependencies by running:
+   ```
+   pip3 install -r requirements.txt
+   ```
+3. **Run the Training**: Kickstart the training process with a simple command:
+   ```
+   python main.py --config <Path to config file>
+   ```
+4. **Visualize Results**: Utilize TensorBoard to observe training progress and visualize the agent's performance in real-time. Wandb is also supported, simply set enable to True and replace with your account information in config files.
+   ```
+   tensorboard --logdir=runs
+   ```
+    **Optional: Visualize imagine sequences**: Using saved models to visualize agent's prediction of environment dynamic. You would need to create a config folder in the run logging directory and drag the training config file in
+   ```
+   python imagine.py --runpath <Path to run file>
+   ```
+## Citation
+This implementation was made possible thanks to these papers.
+```bibtex
+@article{hafner2019dream,
+  title={Dream to Control: Learning Behaviors by Latent Imagination},
+  author={Hafner, Danijar and Lillicrap, Timothy and Norouzi, Mohammad and Ba, Jimmy},
+  journal={arXiv preprint arXiv:1912.01603},
+  year={2019}
+}
+@misc{1801.00690,
+  title = {DeepMind Control Suite},
+  author = {Yuval Tassa and Yotam Doron and Alistair Muldal and Tom Erez and Yazhe Li and Diego de Las Casas and David Budden and Abbas Abdolmaleki and Josh Merel and Andrew Lefrancq and Timothy Lillicrap and Martin Riedmiller},
+  journal = {arXiv preprint arXiv:1801.00690},
+  year = {2018},
+}
+```
+## Contributions
+Contributions are welcome! Whether it's extending functionality, improving efficiency, or correcting bugs, your input helps make this project better for everyone.

algos/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

algos/dreamer.py ADDED Viewed

	@@ -0,0 +1,489 @@

+"""
+Author: Minh Pham-Dinh
+Created: Jan 27th, 2024
+Last Modified: Feb 10th, 2024
+Email: [email protected]
+Description:
+    main Dreamer file.
+    The implementation is based on:
+    Hafner et al., "Dream to Control: Learning Behaviors by Latent Imagination," 2019.
+    [Online]. Available: https://arxiv.org/abs/1912.01603
+"""
+# Standard Library Imports
+import os
+import numpy as np
+import yaml
+from tqdm import tqdm
+import wandb
+# Machine Learning and Data Processing Imports
+import torch
+import torch.nn as nn
+import torch.optim as optim
+# Custom Utility Imports
+import utils.models as models
+from utils.buffer import ReplayBuffer
+from utils.utils import td_lambda, log_metrics
+class Dreamer:
+    def __init__(self, config, logpath, env, writer = None, wandb_writer=None):
+        self.config = config
+        self.device = torch.device(self.config.device)
+        self.env = env
+        self.obs_size = env.observation_space.shape
+        self.action_size = env.action_space.n if self.config.env.discrete else env.action_space.shape[0]
+        self.epsilon = self.config.main.epsilon_start
+        self.env_step = 0
+        self.logpath = logpath
+        # Set random seed for reproducibility
+        np.random.seed(self.config.seed)
+        torch.manual_seed(self.config.seed)
+        #dynamic networks initialized
+        self.rssm = models.RSSM(self.config.main.stochastic_size,
+                                self.config.main.embedded_obs_size,
+                                self.config.main.deterministic_size,
+                                self.config.main.hidden_units,
+                                self.action_size).to(self.device)
+        self.reward = models.RewardNet(self.config.main.stochastic_size + self.config.main.deterministic_size,
+                                       self.config.main.hidden_units).to(self.device)
+        if self.config.main.continue_loss:
+            self.cont_net = models.ContinuoNet(self.config.main.stochastic_size + self.config.main.deterministic_size,
+                                        self.config.main.hidden_units).to(self.device)
+        self.encoder = models.ConvEncoder(input_shape=self.obs_size).to(self.device)
+        self.decoder = models.ConvDecoder(self.config.main.stochastic_size,
+                                              self.config.main.deterministic_size,
+                                              out_shape=self.obs_size).to(self.device)
+        self.dyna_parameters = (
+            list(self.rssm.parameters())
+            + list(self.reward.parameters())
+            + list(self.encoder.parameters())
+            + list(self.decoder.parameters())
+        )
+        if self.config.main.continue_loss:
+          self.dyna_parameters += list(self.cont_net.parameters())
+        #behavior networks initialized
+        self.actor = models.Actor(self.config.main.stochastic_size + self.config.main.deterministic_size,
+                                  self.config.main.hidden_units,
+                                  self.action_size,
+                                  self.config.env.discrete).to(self.device)
+        self.critic = models.Critic(self.config.main.stochastic_size + self.config.main.deterministic_size,
+                                  self.config.main.hidden_units).to(self.device)
+        #optimizers
+        self.dyna_optimizer = optim.Adam(self.dyna_parameters, lr=self.config.main.dyna_model_lr)
+        self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=self.config.main.actor_lr)
+        self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=self.config.main.critic_lr)
+        self.gradient_step = 0
+        #buffer
+        self.buffer = ReplayBuffer(self.config.main.buffer_capacity, self.obs_size, (self.action_size, ))
+        #tracking stuff
+        self.wandb_writer = wandb_writer
+        self.writer = writer
+    def update_epsilon(self):
+        """In use for decaying epsilon in discrete env
+        Returns:
+            _type_: _description_
+        """
+        eps_start = self.config.main.epsilon_start
+        eps_end = self.config.main.epsilon_end
+        decay_steps = self.config.main.eps_decay_steps
+        decay_rate = (eps_start - eps_end) / (decay_steps)
+        self.epsilon = max(eps_end, eps_start - decay_rate*self.gradient_step)
+    def train(self):
+        """main training loop, implementation follow closely with the loop from the official paper
+        Returns:
+            _type_: _description_
+        """
+        #prefill dataset
+        ep = 0
+        obs, _ = self.env.reset()
+        while ep < self.config.main.data_init_ep:
+            action = self.env.action_space.sample()
+            if self.config.env.discrete:
+                actions = np.zeros((self.action_size, ))
+                actions[action] = 1.0
+            else:
+                actions = action
+            next_obs, reward, termination, truncation, info = self.env.step(action)
+            self.buffer.add(obs, actions, reward, termination or truncation)
+            obs = next_obs
+            if "episode" in info:
+                obs, _ = self.env.reset()
+                ep += 1
+                print(ep)
+                if 'video_path' in info and self.wandb_writer:
+                    self.wandb_writer.log({'performance/videos': wandb.Video(info['video_path'], format='webm')})
+        #main train loop
+        for _ in tqdm(range(self.config.main.total_iter)):
+            #save model if reached checkpoint
+            if _ % self.config.main.save_freq == 0:
+                #check if models folder exist
+                directory = self.logpath + 'models/'
+                os.makedirs(directory, exist_ok=True)
+                #save models
+                torch.save(self.rssm, self.logpath + 'models/rssm_model')
+                torch.save(self.encoder, self.logpath + 'models/encoder')
+                torch.save(self.decoder, self.logpath + 'models/decoder')
+                torch.save(self.actor, self.logpath + 'models/actor')
+                torch.save(self.critic, self.logpath + 'models/critic')
+            #run eval if reach eval checkpoint
+            if _ % self.config.main.eval_freq == 0:
+                eval_score = self.data_collection(self.config.main.eval_eps, eval=True)
+                metrics = {'performance/evaluation score': eval_score}
+                log_metrics(metrics, self.env_step, self.writer, self.wandb_writer)
+            #training step
+            for c in tqdm(range(self.config.main.collect_iter)):
+                #draw data
+                batch = self.buffer.sample(self.config.main.batch_size, self.config.main.seq_len, self.device)
+                #dynamic learning
+                post, deter = self.dynamic_learning(batch)
+                #behavioral learning
+                self.behavioral_learning(post, deter)
+                #update step
+                self.gradient_step += 1
+                self.update_epsilon()
+            # collect more data with exploration noise
+            self.data_collection(self.config.main.data_interact_ep)
+    def dynamic_learning(self, batch):
+        """Learning the dynamic model. In this method, we sequentially pass data in the RSSM to
+        learn the model
+        Args:
+            batch (addict.Dict): batches of data
+        """
+        '''
+        We unpack the batch. A batch contains:
+        - b_obs (batch_size, seq_len, *obs.shape): batches of observation
+        - b_a (batch_size, seq_len, 1): batches of action
+        - b_r (batch_size, seq_len, 1): batches of rewards
+        - b_d (batch_size, seq_len, 1): batches of termination signal
+        '''
+        b_obs = batch.obs
+        b_a = batch.actions
+        b_r = batch.rewards
+        b_d = batch.dones
+        batch_size, seq_len, _ = b_r.shape
+        eb_obs = self.encoder(b_obs)
+        #initialized stochastic states (posterior) and deterministic states to first pass into the recurrent model
+        posterior = torch.zeros((batch_size, self.config.main.stochastic_size)).to(self.device)
+        deterministic = torch.zeros((batch_size, self.config.main.deterministic_size)).to(self.device)
+        #initialized memory storing of sequential gradients data
+        posteriors = torch.zeros((batch_size, seq_len-1, self.config.main.stochastic_size)).to(self.device)
+        priors = torch.zeros((batch_size, seq_len-1, self.config.main.stochastic_size)).to(self.device)
+        deterministics = torch.zeros((batch_size, seq_len-1, self.config.main.deterministic_size)).to(self.device)
+        posterior_means = torch.zeros_like(posteriors).to(self.device)
+        posterior_stds = torch.zeros_like(posteriors).to(self.device)
+        prior_means = torch.zeros_like(priors).to(self.device)
+        prior_stds = torch.zeros_like(priors).to(self.device)
+        #start passing data through the dynamic model
+        for t in (range(1, seq_len)):
+            deterministic = self.rssm.recurrent(posterior, b_a[:, t-1, :], deterministic)
+            prior_dist, prior = self.rssm.transition(deterministic)
+            #detail observation is shifted 1 timestep ahead(action is associated with the next state)
+            posterior_dist, posterior = self.rssm.representation(eb_obs[:, t, :], deterministic)
+            '''
+            store recurrent data
+            data are shifted 1 timestep ahead. Start from the second timestep or t=1
+            '''
+            posteriors[:, t-1, :] = posterior
+            posterior_means[:, t-1, :] = posterior_dist.mean
+            posterior_stds[:, t-1, :] = posterior_dist.scale
+            priors[:, t-1, :] = prior
+            prior_means[:, t-1, :] = prior_dist.mean
+            prior_stds[:, t-1, :] = prior_dist.scale
+            deterministics[:, t-1, :] = deterministic
+        #we start optimizing model with the provided data
+        '''
+        Reconstruction loss. This loss helps the model learn to encode pixels observation.
+        '''
+        mps_flatten = False
+        if self.device == torch.device("mps"):
+            mps_flatten = True
+        reconstruct_dist = self.decoder(posteriors, deterministics, mps_flatten)
+        target = b_obs[:, 1:]
+        if mps_flatten:
+            target = target.reshape(-1, *self.obs_size)
+        reconstruct_loss = reconstruct_dist.log_prob(target).mean()
+        #reward loss
+        rewards = self.reward(posteriors, deterministics)
+        rewards_dist = torch.distributions.Normal(rewards, 1)
+        rewards_dist = torch.distributions.Independent(rewards_dist, 1)
+        rewards_loss = rewards_dist.log_prob(b_r[:, 1:]).mean()
+        '''
+        Continuity loss. This loss term helps predict the probability of an episode terminate at a particular state
+        '''
+        if self.config.main.continue_loss:
+            # calculate log prob manually as tensorflow doesn't support float value in logprob of Bernoulli
+            # follow closely to Hafner's official code for Dreamer
+            cont_logits, _ = self.cont_net(posteriors, deterministics)
+            cont_target = (1 - b_d[:, 1:]) * self.config.main.discount
+            continue_loss = torch.nn.functional.binary_cross_entropy_with_logits(cont_logits, cont_target)
+        else:
+            continue_loss = torch.zeros((1)).to(self.device)
+        '''
+        KL loss. Matching the distribution of transition and representation model. This is to ensure we have the accurate transition model for use in imagination process
+        '''
+        priors_dist = torch.distributions.Independent(
+            torch.distributions.Normal(prior_means, prior_stds), 1
+        )
+        posteriors_dist = torch.distributions.Independent(
+            torch.distributions.Normal(posterior_means, posterior_stds), 1
+        )
+        kl_loss = torch.max(
+            torch.mean(torch.distributions.kl.kl_divergence(posteriors_dist, priors_dist)),
+            torch.tensor(self.config.main.free_nats).to(self.device)
+        )
+        total_loss = self.config.main.kl_divergence_scale * kl_loss - reconstruct_loss - rewards_loss + continue_loss
+        self.dyna_optimizer.zero_grad()
+        total_loss.backward()
+        nn.utils.clip_grad_norm_(
+            self.dyna_parameters,
+            self.config.main.clip_grad,
+            norm_type=self.config.main.grad_norm_type,
+        )
+        self.dyna_optimizer.step()
+        #tensorboard logging
+        metrics = {
+            'Dynamic_model/KL': kl_loss.item(),
+            'Dynamic_model/Reconstruction': reconstruct_loss.item(),
+            'Dynamic_model/Reward': rewards_loss.item(),
+            'Dynamic_model/Continue': continue_loss.item(),
+            'Dynamic_model/Total': total_loss.item()
+        }
+        log_metrics(metrics, self.gradient_step, self.writer, self.wandb_writer)
+        return posteriors.detach(), deterministics.detach()
+    def behavioral_learning(self, state, deterministics):
+        """Learning behavioral through latent imagination
+        Args:
+            self (_type_): _description_
+            state (batch_size, seq_len-1, stoch_state_size): starting point state
+            deterministics (batch_size, seq_len-1, stoch_state_size)
+        """
+        #flatten the batches --> new size (batch_size * (seq_len-1), *)
+        state = state.reshape(-1, self.config.main.stochastic_size)
+        deterministics = deterministics.reshape(-1, self.config.main.deterministic_size)
+        batch_size, stochastic_size = state.shape
+        _, deterministics_size = deterministics.shape
+        #initialized trajectories
+        state_trajectories = torch.zeros((batch_size, self.config.main.horizon, stochastic_size)).to(self.device)
+        deterministics_trajectories = torch.zeros((batch_size, self.config.main.horizon, deterministics_size)).to(self.device)
+        #imagine trajectories
+        for t in range(self.config.main.horizon):
+            # do not include the starting state
+            action = self.actor(state, deterministics)
+            deterministics = self.rssm.recurrent(state, action, deterministics)
+            _, state = self.rssm.transition(deterministics)
+            state_trajectories[:, t, :] = state
+            deterministics_trajectories[:, t, :] = deterministics
+        '''
+        After imagining, we have both the state trajectories and deterministic trajectories, which can be used to create latent states.
+        - state_trajectories (N, HORIZON_LEN)
+        - deteerministic_trajectories (N, HORIZON_LEN)
+        '''
+        #actor update
+        #compute rewards for each trajectories
+        rewards = self.reward(state_trajectories, deterministics_trajectories)
+        rewards_dist = torch.distributions.Normal(rewards, 1)
+        rewards_dist = torch.distributions.Independent(rewards_dist, 1)
+        rewards = rewards_dist.mode
+        if self.config.main.continue_loss:
+            _, conts_dist = self.cont_net(state_trajectories, deterministics_trajectories)
+            continues = conts_dist.mean
+        else:
+            continues = self.config.main.discount * torch.ones_like(rewards)
+        values = self.critic(state_trajectories, deterministics_trajectories).mode
+        #calculate trajectories returns
+        #returns should have shape (N, HORIZON_LEN - 1, 1) (last values are ignored due to nature of bootstrapping)
+        returns = td_lambda(
+            rewards,
+            continues,
+            values,
+            self.config.main.lambda_,
+            self.device
+        )
+        #culm product for discount
+        discount = torch.cumprod(torch.cat((
+            torch.ones_like(continues[:, :1]).to(self.device),
+            continues[:, :-2]
+        ), 1), 1).detach()
+        # actor optimizing
+        actor_loss = -(discount * returns).mean()
+        self.actor_optimizer.zero_grad()
+        actor_loss.backward()
+        nn.utils.clip_grad_norm_(
+            self.actor.parameters(),
+            self.config.main.clip_grad,
+            norm_type=self.config.main.grad_norm_type,
+        )
+        self.actor_optimizer.step()
+        # critic optimizing
+        values_dist = self.critic(state_trajectories[:, :-1].detach(), deterministics_trajectories[:, :-1].detach())
+        critic_loss = -(discount.squeeze() * values_dist.log_prob(returns.detach())).mean()
+        self.critic_optimizer.zero_grad()
+        critic_loss.backward()
+        nn.utils.clip_grad_norm_(
+            self.critic.parameters(),
+            self.config.main.clip_grad,
+            norm_type=self.config.main.grad_norm_type,
+        )
+        self.critic_optimizer.step()
+        metrics = {
+            'Behavorial_model/Actor': actor_loss.item(),
+            'Behavorial_model/Critic': critic_loss.item()
+        }
+        log_metrics(metrics, self.gradient_step, self.writer, self.wandb_writer)
+    @torch.no_grad()
+    def data_collection(self, num_episodes, eval=False):
+        """data collection method. Roll out agent a number of episodes and collect data
+        If eval=True. The agent is set for evaluation mode with no exploration noise and data collection
+        Args:
+            num_episodes (int): number of episodes
+            eval (bool): Evaluation mode. Defaults to False.
+            random (bool): Random mode. Defaults to False.
+        Returns:
+            average_score: average score over number of rollout episodes
+        """
+        score = 0
+        ep = 0
+        obs, _ = self.env.reset()
+        #initialized all zeros
+        posterior = torch.zeros((1, self.config.main.stochastic_size)).to(self.device)
+        deterministic = torch.zeros((1, self.config.main.deterministic_size)).to(self.device)
+        action = torch.zeros((1, self.action_size)).to(self.device)
+        while ep < num_episodes:
+            embed_obs = self.encoder(torch.from_numpy(obs).to(self.device, dtype=torch.float)) #(1, embed_obs_sz)
+            deterministic = self.rssm.recurrent(posterior, action, deterministic)
+            _, posterior = self.rssm.representation(embed_obs, deterministic)
+            actor_out = self.actor(posterior, deterministic)
+            #detail: add exploration noise if not in evaluation mode
+            if not eval:
+                actions = actor_out.cpu().numpy()
+                if self.config.env.discrete:
+                    if np.random.rand() < self.epsilon:
+                        action = self.env.action_space.sample()
+                    else:
+                        action = np.argmax(actions)
+                else:
+                    mean_noise = self.config.main.mean_noise
+                    std_noise = self.config.main.std_noise
+                    normal_dist = torch.distributions.Normal(actor_out + mean_noise, std_noise)
+                    sampled_action = normal_dist.sample().cpu().numpy()
+                    actions = np.clip(sampled_action, a_min=-1, a_max=1)
+                    action = actions[0]
+            else:
+                actions = actor_out.cpu().numpy()
+                if self.config.env.discrete:
+                    action = np.argmax(actions)
+                else:
+                    actions = np.clip(actions, a_min=-1, a_max=1)
+                    action = actions[0]
+            next_obs, reward, termination, truncation, info = self.env.step(action)
+            if not eval:
+                self.buffer.add(obs, actions, reward, termination | truncation)
+                self.env_step += self.config.env.action_repeat
+            obs = next_obs
+            action = actor_out
+            if "episode" in info:
+                cur_score = info["episode"]["r"][0]
+                score += cur_score
+                obs, _ = self.env.reset()
+                ep += 1
+                if 'video_path' in info and self.wandb_writer:
+                    self.wandb_writer.log({'performance/videos': wandb.Video(info['video_path'], format='webm')})
+                log_metrics({'performance/training score': cur_score}, self.env_step, self.writer, self.wandb_writer)
+                posterior = torch.zeros((1, self.config.main.stochastic_size)).to(self.device)
+                deterministic = torch.zeros((1, self.config.main.deterministic_size)).to(self.device)
+                action = torch.zeros((1, self.action_size)).to(self.device)
+        return score/num_episodes

bash/setup.sh ADDED Viewed

	@@ -0,0 +1,32 @@

+#!/bin/bash
+# Update and Upgrade the System
+echo "Updating and upgrading the system..."
+sudo apt-get update -y
+sudo apt-get upgrade -y
+# Install dependencies for Gymnasium
+echo "Installing dependencies for Gymnasium..."
+# Development tools
+sudo apt-get install -y build-essential
+# Python 3 and pip
+sudo apt-get install -y python3 python3-pip
+sudo apt-get install python3-opencv
+# System libraries
+sudo apt-get install -y libglew-dev libjpeg-dev libboost-all-dev libglu1-mesa-dev freeglut3-dev mesa-common-dev
+# SWIG for interface generation
+sudo apt-get install -y swig
+# Gymnasium and additional dependencies via pip
+echo "Installing requirements.txt"
+pip3 install -r requirements.txt
+sudo apt-get install xvfb
+Xvfb :99 -screen 0 1024x768x24 &
+export DISPLAY=:99
+echo "Setup complete!"

configs/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

configs/dm_control/Cart-pole.yml ADDED Viewed

	@@ -0,0 +1,64 @@

+device: "cuda"
+experiment_name: Cart-pole
+seed: 0
+env:
+    env_id: cartpole
+    task: balance
+    discrete: False
+    new_obs_size: [64, 64]
+    norm_obs: True
+tensorboard:
+    enable: False
+    log_dir: "./runs/"
+    log_frequency: 1  # Log every 1000 steps
+wandb:
+    enable: True
+    project: "dreamer"
+    entity: "phdminh01"
+    log_frequency: 1
+video_recording:
+    enable: True
+    record_frequency: 50  #episodes
+main:
+    continue_loss: False
+    continue_scale_factor: 10
+    total_iter: 2000
+    save_freq: 20
+    collect_iter: 100
+    data_interact_ep: 1
+    # data_init_ep: 1
+    data_init_ep: 5
+    horizon: 15
+    batch_size: 50
+    seq_len: 50
+    eval_eps: 3
+    eval_freq: 5
+    kl_divergence_scale : 1
+    free_nats : 3
+    discount : 0.99
+    lambda_ : 0.95
+    actor_lr : 8.0e-5
+    critic_lr : 8.0e-5
+    dyna_model_lr : 6.0e-4
+    grad_norm_type : 2
+    clip_grad : 100
+    hidden_units: 300
+    deterministic_size : 200
+    stochastic_size : 30
+    embedded_obs_size : 1024
+    buffer_capacity : 500000
+    epsilon_start: 0.4
+    epsilon_end: 0.1
+    eps_decay_steps: 200000
+    mean_noise: 0
+    std_noise: 0.3

configs/dm_control/Quadruped.yml ADDED Viewed

	@@ -0,0 +1,64 @@

+device: "cuda"
+experiment_name: Quadruped
+seed: 0
+env:
+    env_id: quadruped
+    task: walk
+    discrete: False
+    new_obs_size: [64, 64]
+    norm_obs: True
+tensorboard:
+    enable: False
+    log_dir: "./runs/"
+    log_frequency: 1  # Log every 1000 steps
+wandb:
+    enable: True
+    project: "dreamer"
+    entity: "phdminh01"
+    log_frequency: 1
+video_recording:
+    enable: True
+    record_frequency: 50  #episodes
+main:
+    continue_loss: False
+    continue_scale_factor: 10
+    total_iter: 2000
+    save_freq: 20
+    collect_iter: 100
+    data_interact_ep: 1
+    # data_init_ep: 1
+    data_init_ep: 5
+    horizon: 15
+    batch_size: 50
+    seq_len: 50
+    eval_eps: 3
+    eval_freq: 5
+    kl_divergence_scale : 1
+    free_nats : 3
+    discount : 0.99
+    lambda_ : 0.95
+    actor_lr : 8.0e-5
+    critic_lr : 8.0e-5
+    dyna_model_lr : 6.0e-4
+    grad_norm_type : 2
+    clip_grad : 100
+    hidden_units: 300
+    deterministic_size : 200
+    stochastic_size : 30
+    embedded_obs_size : 1024
+    buffer_capacity : 500000
+    epsilon_start: 0.4
+    epsilon_end: 0.1
+    eps_decay_steps: 200000
+    mean_noise: 0
+    std_noise: 0.3

configs/dm_control/Walker.yml ADDED Viewed

	@@ -0,0 +1,65 @@

+device: "mps"
+experiment_name: Walker
+seed: 0
+env:
+    env_id: walker
+    task: walk
+    new_obs_size: [64, 64]
+    action_repeat: 2
+    time_limit: 1000
+tensorboard:
+    enable: False
+    log_dir: "./runs/"
+    log_frequency: 1  # Log every 1000 steps
+wandb:
+    enable: False
+    project: "dreamer"
+    entity: "phdminh01"
+    log_frequency: 1
+video_recording:
+    enable: False
+    record_frequency: 100  #episodes
+main:
+    continue_loss: False
+    continue_scale_factor: 10
+    total_iter: 2000
+    save_freq: 20
+    collect_iter: 100
+    data_interact_ep: 1
+    # data_init_ep: 1
+    data_init_ep: 5
+    horizon: 15
+    batch_size: 50
+    seq_len: 50
+    eval_eps: 3
+    eval_freq: 5
+    kl_divergence_scale : 1
+    free_nats : 3
+    discount : 0.99
+    lambda_ : 0.95
+    use_continue_flag : True
+    actor_lr : 8.0e-5
+    critic_lr : 8.0e-5
+    dyna_model_lr : 6.0e-4
+    grad_norm_type : 2
+    clip_grad : 100
+    hidden_units: 300
+    deterministic_size : 200
+    stochastic_size : 30
+    embedded_obs_size : 1024
+    buffer_capacity : 500000
+    epsilon_start: 0.4
+    epsilon_end: 0.1
+    eps_decay_steps: 200000
+    mean_noise: 0
+    std_noise: 0.3

configs/gymnasium/Boxing-v5.yml ADDED Viewed

	@@ -0,0 +1,67 @@

+device: "mps"
+experiment_name: Boxing-v5-new
+seed: 0
+env:
+    env_id: ALE/Boxing-v5
+    channel_first: True
+    discrete: True
+    resize_obs: True
+    new_obs_size: [64, 64]
+    norm_obs: True
+tensorboard:
+    enable: True
+    log_dir: "./runs/"
+    log_frequency: 1  # Log every 1000 steps
+wandb:
+    enable: False
+    project: "dreamer"
+    entity: "phdminh01"
+    log_frequency: 1
+video_recording:
+    enable: True
+    record_frequency: 100  #episodes
+    save_path: "./runs/"
+main:
+    continue_loss: False
+    continue_scale_factor: 10
+    total_iter: 2000
+    save_freq: 20
+    collect_iter: 100
+    data_interact_ep: 1
+    data_init_ep: 1
+    # data_init_ep: 5
+    horizon: 10
+    batch_size: 50
+    seq_len: 50
+    eval_eps: 3
+    eval_freq: 5
+    kl_divergence_scale : 1
+    free_nats : 3
+    discount : 0.99
+    lambda_ : 0.95
+    use_continue_flag : True
+    actor_lr : 8.0e-5
+    critic_lr : 8.0e-5
+    dyna_model_lr : 6.0e-4
+    grad_norm_type : 2
+    clip_grad : 100
+    hidden_units: 400
+    deterministic_size : 600
+    stochastic_size : 600
+    embedded_obs_size : 1024
+    buffer_capacity : 500000
+    epsilon_start: 0.4
+    epsilon_end: 0.1
+    eps_decay_steps: 200000
+    mean_noise: 0
+    std_noise: 0.3

configs/gymnasium/Pacman-v5.yml ADDED Viewed

	@@ -0,0 +1,66 @@

+device: "cuda"
+experiment_name: Pacman-v5
+seed: 0
+env:
+    env_id: ALE/MsPacman-v5
+    channel_first: True
+    discrete: True
+    resize_obs: True
+    new_obs_size: [64, 64]
+    norm_obs: True
+tensorboard:
+    enable: False
+    log_dir: "./runs/"
+    log_frequency: 1  # Log every 1000 steps
+wandb:
+    enable: True
+    project: "dreamer"
+    entity: "phdminh01"
+    log_frequency: 1
+video_recording:
+    enable: True
+    record_frequency: 100  #episodes
+    save_path: "./runs/"
+main:
+    continue_loss: True
+    continue_scale_factor: 10
+    total_iter: 2000
+    save_freq: 20
+    collect_iter: 100
+    data_interact_ep: 1
+    # data_init_ep: 1
+    data_init_ep: 5
+    horizon: 15
+    batch_size: 50
+    seq_len: 50
+    eval_eps: 3
+    eval_freq: 5
+    kl_divergence_scale : 1
+    free_nats : 3
+    discount : 0.99
+    lambda_ : 0.95
+    actor_lr : 8.0e-5
+    critic_lr : 8.0e-5
+    dyna_model_lr : 6.0e-4
+    grad_norm_type : 2
+    clip_grad : 100
+    hidden_units: 400
+    deterministic_size : 600
+    stochastic_size : 600
+    embedded_obs_size : 1024
+    buffer_capacity : 500000
+    epsilon_start: 0.4
+    epsilon_end: 0.1
+    eps_decay_steps: 200000
+    mean_noise: 0
+    std_noise: 0.3

configs/gymnasium/ant_v4.yml ADDED Viewed

	@@ -0,0 +1,65 @@

+device: "mps"
+experiment_name: Ant-v4
+seed: 0
+gymnasium:
+    env_id: Ant-v4
+    channel_first: True
+    pixels: True
+    discrete: False
+    resize_obs: True
+    new_obs_size: [64, 64]
+    norm_obs: True
+tensorboard:
+    enable: False
+    log_dir: "./runs/"
+    log_frequency: 1  # Log every 1000 steps
+wandb:
+    enable: False
+    project: "dreamer"
+    entity: "phdminh01"
+    log_frequency: 1
+video_recording:
+    enable: False
+    record_frequency: 100  #episodes
+main:
+    total_iter: 2000
+    save_freq: 100
+    collect_iter: 100
+    data_interact_ep: 1
+    # data_init_ep: 1
+    data_init_ep: 5
+    horizon: 15
+    batch_size: 50
+    seq_len: 50
+    eval_eps: 3
+    eval_freq: 5
+    kl_divergence_scale : 1
+    free_nats : 3
+    discount : 0.99
+    lambda_ : 0.95
+    use_continue_flag : True
+    actor_lr : 3.0e-4
+    critic_lr : 3.0e-4
+    dyna_model_lr : 6.0e-4
+    grad_norm_type : 2
+    clip_grad : 100
+    hidden_units: 400
+    deterministic_size : 600
+    stochastic_size : 600
+    embedded_obs_size : 1024
+    buffer_capacity : 500000
+    epsilon_start: 0.4
+    epsilon_end: 0.1
+    eps_decay_steps: 200000
+    mean_noise: 0
+    std_noise: 0.3

configs/gymnasium/car_racing_config.yml ADDED Viewed

	@@ -0,0 +1,64 @@

+device: "cuda"
+experiment_name: CarRacing-v2
+seed: 0
+gymnasium:
+    env_id: CarRacing-v2
+    channel_first: True
+    discrete: False
+    resize_obs: True
+    new_obs_size: [64, 64]
+    norm_obs: True
+tensorboard:
+    enable: True
+    log_dir: "./runs/"
+    log_frequency: 1  # Log every 1000 steps
+wandb:
+    enable: True
+    project: "dreamer"
+    entity: "phdminh01"
+    log_frequency: 1
+video_recording:
+    enable: True
+    record_frequency: 100  #episodes
+main:
+    total_iter: 2000
+    save_freq: 100
+    collect_iter: 100
+    data_interact_ep: 1
+    # data_init_ep: 1
+    data_init_ep: 5
+    horizon: 15
+    batch_size: 50
+    seq_len: 50
+    eval_eps: 3
+    eval_freq: 5
+    kl_divergence_scale : 1
+    free_nats : 3
+    discount : 0.99
+    lambda_ : 0.95
+    use_continue_flag : True
+    actor_lr : 3.0e-4
+    critic_lr : 3.0e-4
+    dyna_model_lr : 6.0e-4
+    grad_norm_type : 2
+    clip_grad : 100
+    hidden_units: 400
+    deterministic_size : 600
+    stochastic_size : 600
+    embedded_obs_size : 1024
+    buffer_capacity : 500000
+    epsilon_start: 0.4
+    epsilon_end: 0.1
+    eps_decay_steps: 200000
+    mean_noise: 0
+    std_noise: 0.3

gif/boxing.gif ADDED Viewed

gif/boxing_imagine.gif ADDED Viewed

gif/pacman.gif ADDED Viewed

Git LFS Details

SHA256: 0d138c415095804b0e565c041226613c3d6c9a82c06d00a7a5349ca5e94d557a
Pointer size: 132 Bytes
Size of remote file: 1.6 MB

gif/pacman_imagine.gif ADDED Viewed

Git LFS Details

SHA256: 98462a09ec13b588685cde9f0bd69bfac4ab6d295e82d24c55e013a9e760f7a4
Pointer size: 132 Bytes
Size of remote file: 1.11 MB

gif/quadruped.gif ADDED Viewed

Git LFS Details

SHA256: 9d9fed8476008b30fc517c8fbbf43c795a719be6b5df439bc62f7146efb44bb9
Pointer size: 132 Bytes
Size of remote file: 1.79 MB

gif/quadruped_imagine.gif ADDED Viewed

gif/walker_imagine.gif ADDED Viewed

Git LFS Details

SHA256: 91c7d7e0faad5d120829e4fced2baa65d7d37367e5b05b24c77caf9d1db48c9a
Pointer size: 132 Bytes
Size of remote file: 1 MB

imagine.py ADDED Viewed

	@@ -0,0 +1,88 @@

+"""
+Author: Minh Pham-Dinh
+Created: Feb 4th, 2024
+Last Modified: Feb 6th, 2024
+Email: [email protected]
+Description:
+    Imagination file. Run this file to generate dream sequences
+"""
+import sys
+import argparse
+from utils.wrappers import DMCtoGymWrapper, AtariPreprocess
+from addict import Dict
+import yaml
+import gymnasium as gym
+import torch
+from tqdm import tqdm
+import numpy as np
+import glob
+parser = argparse.ArgumentParser(description='Process configuration file path.')
+parser.add_argument('--runpath', type=str, help='Path to the run file.', required=True)
+parser.add_argument('--horizon', type=int, help='number of imagination steps.', default=15)
+# Parse the arguments
+args = parser.parse_args()
+# Load the configuration file specified by the command line argument
+run_path = args.runpath
+HORIZON = args.horizon
+config_files = glob.glob(run_path + '/config/*.yml')
+if len(config_files) != 1:
+    print('there should only be 1 config file in config directory')
+with open(config_files[0], 'r') as file:
+    config = Dict(yaml.load(file, Loader=yaml.FullLoader))
+env_id = config.env.env_id
+if 'ALE' in config.env.env_id:
+    env = gym.make(env_id, render_mode='rgb_array')
+    env = AtariPreprocess(env, config.env.new_obs_size,
+                          False)
+else:
+    task = config.env.task
+    env = DMCtoGymWrapper(env_id, task,
+                          resize=config.env.new_obs_size,
+                          record=False)
+print("start imagining")
+encode = torch.load(run_path + '/models/encoder', map_location=torch.device('cpu') )
+decoder = torch.load(run_path + '/models/decoder', map_location=torch.device('cpu') )
+rssm = torch.load(run_path + '/models/rssm_model', map_location=torch.device('cpu') )
+actor = torch.load(run_path + '/models/actor', map_location=torch.device('cpu'))
+obs, _ = env.reset()
+for i in range(100):
+    obs, _, _, _, _ = env.step(env.action_space.sample())
+posterior = torch.zeros((1, config.main.stochastic_size))
+deterministic = torch.zeros((1, config.main.deterministic_size))
+e_obs = encode(torch.from_numpy(obs).to(dtype=torch.float))
+_, posterior = rssm.representation(e_obs, deterministic)
+from PIL import Image
+frames = []
+for i in tqdm(range(200)):
+    actions = actor(posterior, deterministic)
+    deterministic = rssm.recurrent(posterior, actions, deterministic)
+    dist, posterior = rssm.transition(deterministic)
+    d_obs = decoder(posterior, deterministic)
+    d_obs = d_obs.mean.squeeze().detach().numpy()
+    obs = ((d_obs.transpose([1,2,0])  + 0.5) * 255).clip(0, 255).astype(np.uint8)
+    img = Image.fromarray(obs, "RGB")
+    frames.append(img)
+print("saving gif")
+frame_one = frames[0]
+frame_one.save(run_path + "/imagine.gif", format="GIF", append_images=frames, save_all=True, duration=30, loop=0)
+print("finished")

requirements.txt ADDED Viewed

	@@ -0,0 +1,143 @@

+absl-py==2.1.0
+addict==2.4.0
+ale-py==0.8.1
+anyio==4.2.0
+appnope==0.1.3
+argon2-cffi==23.1.0
+argon2-cffi-bindings==21.2.0
+arrow==1.3.0
+async-lru==2.0.4
+attrdict==2.0.1
+attrs==23.2.0
+AutoROM==0.4.2
+AutoROM.accept-rom-license==0.6.1
+Babel==2.14.0
+beautifulsoup4==4.12.3
+bleach==6.1.0
+box2d-py==2.3.5
+cachetools==5.3.2
+certifi==2023.11.17
+cffi==1.16.0
+charset-normalizer==3.3.2
+click==8.1.7
+cloudpickle==3.0.0
+comm==0.2.1
+contourpy==1.2.0
+cycler==0.12.1
+debugpy==1.8.0
+decorator==4.4.2
+defusedxml==0.7.1
+dm-tree==0.1.8
+etils==1.6.0
+Farama-Notifications==0.0.4
+fastjsonschema==2.19.1
+filelock==3.13.1
+fonttools==4.47.2
+fqdn==1.5.1
+fsspec==2023.12.2
+gast==0.5.4
+glfw==2.6.5
+google-auth==2.26.2
+google-auth-oauthlib==1.2.0
+grpcio==1.60.0
+gymnasium==0.29.1
+idna==3.6
+imageio==2.33.1
+imageio-ffmpeg==0.4.9
+importlib-resources==6.1.1
+ipykernel==6.29.0
+ipywidgets==8.1.1
+isoduration==20.11.0
+Jinja2==3.1.3
+json5==0.9.14
+jsonpointer==2.4
+jsonschema==4.21.0
+jsonschema-specifications==2023.12.1
+jupyter==1.0.0
+jupyter-console==6.6.3
+jupyter-events==0.9.0
+jupyter-lsp==2.2.2
+jupyter_client==8.6.0
+jupyter_core==5.7.1
+jupyter_server==2.12.5
+jupyter_server_terminals==0.5.1
+jupyterlab==4.0.10
+jupyterlab-widgets==3.0.9
+jupyterlab_pygments==0.3.0
+jupyterlab_server==2.25.2
+kiwisolver==1.4.5
+lz4==4.3.3
+Markdown==3.5.2
+MarkupSafe==2.1.3
+matplotlib==3.8.2
+mistune==3.0.2
+moviepy==1.0.3
+mpmath==1.3.0
+mujoco==3.1.1
+nbclient==0.9.0
+nbconvert==7.14.2
+nbformat==5.9.2
+nest-asyncio==1.5.9
+networkx==3.2.1
+notebook==7.0.6
+notebook_shim==0.2.3
+numpy==1.26.3
+oauthlib==3.2.2
+opencv-python==4.9.0.80
+overrides==7.4.0
+packaging==23.2
+pandocfilters==1.5.1
+pillow==10.2.0
+platformdirs==4.1.0
+proglog==0.1.10
+prometheus-client==0.19.0
+protobuf==4.23.4
+psutil==5.9.7
+pyasn1==0.5.1
+pyasn1-modules==0.3.0
+pycparser==2.21
+pygame==2.5.2
+PyOpenGL==3.1.7
+pyparsing==3.1.1
+python-dateutil==2.8.2
+python-json-logger==2.0.7
+PyYAML==6.0.1
+pyzmq==25.1.2
+qtconsole==5.5.1
+QtPy==2.4.1
+referencing==0.32.1
+requests==2.31.0
+requests-oauthlib==1.3.1
+rfc3339-validator==0.1.4
+rfc3986-validator==0.1.1
+rpds-py==0.17.1
+rsa==4.9
+Send2Trash==1.8.2
+Shimmy==0.2.1
+sniffio==1.3.0
+soupsieve==2.5
+swig==4.1.1.post1
+sympy==1.12
+tensorboard==2.15.1
+tensorboard-data-server==0.7.2
+tensorflow-probability==0.23.0
+terminado==0.18.0
+tinycss2==1.2.1
+tomli==2.0.1
+torch==2.1.2
+torchaudio==2.1.2
+torchvision==0.16.2
+wandb
+tornado==6.4
+tqdm==4.66.1
+types-python-dateutil==2.8.19.20240106
+typing_extensions==4.9.0
+uri-template==1.3.0
+urllib3==2.1.0
+webcolors==1.13
+webencodings==0.5.1
+websocket-client==1.7.0
+Werkzeug==3.0.1
+widgetsnbextension==4.0.9
+zipp==3.17.0
+dm_control

utils/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

utils/__pycache__/buffer.cpython-310.pyc ADDED Viewed

Binary file (5.29 kB). View file

utils/__pycache__/models.cpython-310.pyc ADDED Viewed

Binary file (11.9 kB). View file

utils/__pycache__/utils.cpython-310.pyc ADDED Viewed

Binary file (3.53 kB). View file

utils/__pycache__/wrappers.cpython-310.pyc ADDED Viewed

Binary file (10.1 kB). View file

utils/buffer.py ADDED Viewed

	@@ -0,0 +1,139 @@

+"""
+Author: Minh Pham-Dinh
+Created: Jan 26th, 2024
+Last Modified: Feb 5th, 2024
+Email: [email protected]
+Description:
+    File containing the ReplayBuffer that will be used in Dreamer.
+    The implementation is based on:
+    Hafner et al., "Dream to Control: Learning Behaviors by Latent Imagination," 2019.
+    [Online]. Available: https://arxiv.org/abs/1912.01603
+"""
+import numpy as np
+from gymnasium import Env
+import torch
+from addict import Dict
+class ReplayBuffer:
+    def __init__(self, capacity, obs_size, action_size):
+        # check if the env is gymnasium or dmc
+        self.obs_size = obs_size
+        self.action_size = action_size
+        # from SimpleDreamer implementation, saving memory
+        state_type = np.uint8 if len(self.obs_size) < 3 else np.float32
+        self.observation = np.zeros((capacity, ) + self.obs_size, dtype=state_type)
+        self.actions = np.zeros((capacity, ) + self.action_size, dtype=np.float32)
+        self.rewards = np.zeros((capacity, 1), dtype=np.float32)
+        self.dones = np.zeros((capacity, 1), dtype=np.float32)
+        self.pointer = 0
+        self.full = False
+        print(f'''
+-----------initialized memory----------
+obs_buffer_shape: {self.observation.shape}
+actions_buffer_shape: {self.actions.shape}
+rewards_buffer_shape: {self.rewards.shape}
+dones_buffer_shape: {self.dones.shape}
+----------------------------------------
+              ''')
+    def add(self, obs, action, reward, done):
+        """Add method for buffer
+        Args:
+            obs (np.array): current observation
+            action (np.array): action taken
+            reward (float): reward received after action
+            next_obs (np.array): next observation
+            done (bool): boolean value of termination or truncation
+        """
+        self.observation[self.pointer] = obs
+        self.actions[self.pointer] = action
+        self.rewards[self.pointer] = reward
+        self.dones[self.pointer] = done
+        self.pointer = (self.pointer + 1) % self.observation.shape[0]
+        if self.pointer == 0:
+            self.full = True
+    def sample(self, batch_size, seq_len, device):
+        """
+        Samples batches of experiences of fixed sequence length from the replay buffer,
+        taking into account the circular nature of the buffer to avoid crossing the
+        "end" of the buffer when it is full.
+        This method ensures that sampled sequences are continuous and do not wrap around
+        the end of the buffer, maintaining the temporal integrity of experiences. This is
+        particularly important when the buffer is full, and the pointer marks the boundary
+        between the newest and oldest data in the buffer.
+        Args:
+            batch_size (int): The number of sequences to sample.
+            seq_len (int): The length of each sequence to sample.
+            device (torch.device): The device on which the sampled data will be loaded.
+        Raises:
+            Exception: If there is not enough data in the buffer to sample a full sequence.
+        Returns:
+            Dict: A dictionary containing the sampled sequences of observations, actions,
+            rewards, and dones. Each item in the dictionary is a tensor of shape
+            (batch_size, seq_len, feature_dimension), except for 'dones' which is of shape
+            (batch_size, seq_len, 1).
+        Notes:
+            - The method handles different scenarios based on the buffer's state (full or not)
+            and the pointer's position to ensure valid sequence sampling without wrapping.
+            - When the buffer is not full, sequences can start from index 0 up to the
+            index where `seq_len` sequences can fit without surpassing the current pointer.
+            - When the buffer is full, the method ensures sequences do not start in a way
+            that would cause them to wrap around past the pointer, effectively crossing
+            the boundary between the newest and oldest data.
+            - This approach guarantees the sampled sequences respect the temporal order
+            and continuity necessary for algorithms that rely on sequences of experiences.
+        """
+        # Ensure there's enough data to sample
+        if self.pointer < seq_len and not self.full:
+            raise Exception('not enough data to sample')
+        # detail: handling different cases for circular sampling
+        if self.full:
+            if self.pointer - seq_len < 0:
+                valid_range = np.arange(self.pointer, self.observation.shape[0] - (self.pointer - seq_len) + 1)
+            else:
+                range_1 = np.arange(0, self.pointer - seq_len + 1)
+                range_2 = np.arange(self.pointer, self.observation.shape[0])
+                valid_range = np.concatenate((range_1, range_2), -1)
+        else:
+            valid_range = np.arange(0, self.pointer-seq_len+1)
+        start_index = np.random.choice(valid_range, (batch_size, 1))
+        seq_len = np.arange(seq_len)
+        sample_idcs = (start_index + seq_len) % self.observation.shape[0]
+        batch = Dict()
+        batch.obs = torch.from_numpy(self.observation[sample_idcs]).to(device)
+        batch.actions = torch.from_numpy(self.actions[sample_idcs]).to(device)
+        batch.rewards = torch.from_numpy(self.rewards[sample_idcs]).to(device)
+        batch.dones = torch.from_numpy(self.dones[sample_idcs]).to(device)
+        return batch
+    def clear(self, ):
+        self.pointer = 0
+        self.full = False
+    def __len__(self, ):
+        return self.pointer

utils/models.py ADDED Viewed

	@@ -0,0 +1,440 @@

+"""
+Author: Minh Pham-Dinh
+Created: Jan 26th, 2024
+Last Modified: Feb 10th, 2024
+Email: [email protected]
+Description:
+    File containing all models that will be used in Dreamer.
+    The implementation is based on:
+    Hafner et al., "Dream to Control: Learning Behaviors by Latent Imagination," 2019.
+    [Online]. Available: https://arxiv.org/abs/1912.01603
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import numpy as np
+def initialize_weights(m):
+    if isinstance(m, (nn.Conv2d, nn.ConvTranspose2d)):
+        nn.init.kaiming_uniform_(m.weight.data, nonlinearity="relu")
+        nn.init.constant_(m.bias.data, 0)
+    elif isinstance(m, nn.Linear):
+        nn.init.kaiming_uniform_(m.weight.data)
+        nn.init.constant_(m.bias.data, 0)
+class RSSM(nn.Module):
+    """Reccurent State Space Model (RSSM)
+    The main model that we will use to learn the latent dynamic of the environment
+    """
+    def __init__(self, stochastic_size, obs_embed_size, deterministic_size, hidden_size, action_size, activation=nn.ELU):
+        super().__init__()
+        self.stochastic_size = stochastic_size
+        self.action_size = action_size
+        self.deterministic_size = deterministic_size
+        self.obs_embed_size = obs_embed_size
+        self.action_size = action_size
+        # recurrent
+        self.recurrent_linear = nn.Sequential(
+            nn.Linear(stochastic_size + action_size, hidden_size),
+            activation(),
+        )
+        self.gru_cell = nn.GRUCell(hidden_size, deterministic_size)
+        # representation model, for calculating posterior
+        self.representatio_model = nn.Sequential(
+            nn.Linear(deterministic_size + obs_embed_size, hidden_size),
+            activation(),
+            nn.Linear(hidden_size, stochastic_size*2)
+        )
+        # transition model, for calculating prior, use for imagining trajectories
+        self.transition_model = nn.Sequential(
+            nn.Linear(deterministic_size, hidden_size),
+            activation(),
+            nn.Linear(hidden_size, stochastic_size*2)
+        )
+    def recurrent(self, stoch_state, action, deterministic):
+        """The recurrent model, calculate the deterministic state given the stochastic state
+        the action, and the prior deterministic
+        Args:
+            a_t-1 (batch_size, action_size): action at time step, cannot be None.
+            s_t-1 (batch_size, stoch_size): stochastic state at time step. Defaults to None.
+            h_t-1 (batch_size, deterministic_size): deterministic at timestep. Defaults to None.
+        Returns:
+            h_t: deterministic at next time step
+        """
+        # initialize some sizes
+        x = torch.cat((action, stoch_state), -1)
+        out = self.recurrent_linear(x)
+        out = self.gru_cell(out, deterministic)
+        return out
+    def representation(self, embed_obs, deterministic):
+        """Calculate the distribution p of the stochastic state.
+        Args:
+            o_t (batch_size, embeded_obs_size): embedded observation (encoded)
+            h_t (batch_size, deterministic_size): determinstic size
+        Returns:
+            s_t posterior_distribution: distribution of stochastic states
+            s_t posterior: sampled stochastic states
+        """
+        x = torch.cat((embed_obs, deterministic), -1)
+        out = self.representatio_model(x)
+        mean, std = torch.chunk(out, 2, -1)
+        std = F.softplus(std) + 0.1
+        post_dist = torch.distributions.Normal(mean, std)
+        post = post_dist.rsample()
+        return post_dist, post
+    def transition(self, deterministic):
+        """Calculate the distribution q of the stochastic state.
+        Args:
+            h_t (batch_size, deterministic_size): determinstic size
+        Returns:
+            s_t prior_distribution: distribution of stochastic states
+            s_t prior: sampled stochastic states
+        """
+        out = self.transition_model(deterministic)
+        mean, std = torch.chunk(out, 2, -1)
+        std = F.softplus(std) + 0.1
+        prior_dist = torch.distributions.Normal(mean, std)
+        prior = prior_dist.rsample()
+        return prior_dist, prior
+class ConvEncoder(nn.Module):
+    def __init__(self, depth=32, input_shape=(3,64,64), activation=nn.ReLU):
+        super().__init__()
+        self.depth = depth
+        self.input_shape = input_shape
+        self.conv_layer = nn.Sequential(
+            nn.Conv2d(
+                in_channels=input_shape[0],
+                out_channels=depth * 1,
+                kernel_size=4,
+                stride=2,
+                padding="valid"
+            ),
+            activation(),
+            nn.Conv2d(
+                in_channels=depth * 1,
+                out_channels=depth * 2,
+                kernel_size=4,
+                stride=2,
+                padding="valid"
+            ),
+            activation(),
+            nn.Conv2d(
+                in_channels=depth * 2,
+                out_channels=depth * 4,
+                kernel_size=4,
+                stride=2,
+                padding="valid"
+            ),
+            activation(),
+            nn.Conv2d(
+                in_channels=depth * 4,
+                out_channels=depth * 8,
+                kernel_size=4,
+                stride=2,
+                padding="valid"
+            ),
+            activation()
+        )
+        self.conv_layer.apply(initialize_weights)
+    def forward(self, x):
+        batch_shape = x.shape[:-len(self.input_shape)]
+        if not batch_shape:
+            batch_shape = (1, )
+        x = x.reshape(-1, *self.input_shape)
+        out = self.conv_layer(x)
+        #flatten output
+        return out.reshape(*batch_shape, -1)
+class ConvDecoder(nn.Module):
+    """Decode latent dynamic
+    Also referred to as observation model by the official Dreamer paper
+    """
+    def __init__(self, stochastic_size, deterministic_size, depth=32, out_shape=(3,64,64), activation=nn.ReLU):
+        super().__init__()
+        self.out_shape = out_shape
+        self.net = nn.Sequential(
+            nn.Linear(deterministic_size + stochastic_size, depth*32),
+            nn.Unflatten(1, (depth * 32, 1)),
+            nn.Unflatten(2, (1, 1)),
+            nn.ConvTranspose2d(
+                depth * 32,
+                depth * 4,
+                kernel_size=5,
+                stride=2,
+            ),
+            activation(),
+            nn.ConvTranspose2d(
+                depth * 4,
+                depth * 2,
+                kernel_size=5,
+                stride=2,
+            ),
+            activation(),
+            nn.ConvTranspose2d(
+                depth * 2,
+                depth * 1,
+                kernel_size=5 + 1,
+                stride=2,
+            ),
+            activation(),
+            nn.ConvTranspose2d(
+                depth * 1,
+                out_shape[0],
+                kernel_size=5+1,
+                stride=2,
+            ),
+        )
+        self.net.apply(initialize_weights)
+    def forward(self, posterior, deterministic, mps_flatten=False):
+        """take in the stochastic state (posterior) and deterministic to construct the latent state then
+        output reconstructed pixel observation
+        Args:
+            s_t (batch_sz, stoch_size): stochastic state (or posterior)
+            h_t (batch_sz, deterministic_size): deterministic state
+            mps_flatten (boolean): whether to flattening the output for mps device or not. This is because M1 GPU can
+                                   only support max 4 dimension (stupid af)
+        Returns:
+            o'_t: reconstructed_obs
+        """
+        x = torch.cat((posterior, deterministic), -1)
+        batch_shape = x.shape[:-1]
+        if not batch_shape:
+            batch_shape = (1, )
+        x = x.reshape(-1, x.shape[-1])
+        if mps_flatten:
+            batch_shape = (-1, )
+        mean = self.net(x).reshape(*batch_shape, *self.out_shape)
+        dist = torch.distributions.Normal(mean, 1)
+        # #flatten output
+        return torch.distributions.Independent(dist, len(self.out_shape))
+class RewardNet(nn.Module):
+    """reward prediction model. It take in the stochastic state and the deterministic to construct
+    latent state. It then output the reward prediciton
+    Args:
+        nn (_type_): _description_
+    """
+    def __init__(self, input_size, hidden_size, activation=nn.ELU):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.Linear(input_size, hidden_size),
+            activation(),
+            nn.Linear(hidden_size, 1)
+        )
+    def forward(self, stoch_state, deterministic):
+        """take in the stochastic state and deterministic to construct the latent state then
+        output reard prediction
+        Args:
+            s_t (batch_sz, stoch_size): stochastic state (or posterior)
+            h_t (batch_sz, deterministic_size): deterministic state
+        Returns:
+            r_t: rewards
+        """
+        x = torch.cat((stoch_state, deterministic), -1)
+        batch_shape = x.shape[:-1]
+        if not batch_shape:
+            batch_shape = (1, )
+        x = x.reshape(-1, x.shape[-1])
+        return self.net(x).reshape(*batch_shape, 1)
+class ContinuoNet(nn.Module):
+    """continuity prediction model. It take in the stochastic state and the deterministic to construct
+    latent state. It then output the prediction of whether the termination state has been reached
+    Args:
+        nn (_type_): _description_
+    """
+    def __init__(self, input_size, hidden_size, activation=nn.ELU):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.Linear(input_size, hidden_size),
+            activation(),
+            nn.Linear(hidden_size, hidden_size),
+            activation(),
+            nn.Linear(hidden_size, 1)
+        )
+    def forward(self, stoch_state, deterministic):
+        """take in the stochastic state and deterministic to construct the latent state then
+        output reard prediction
+        Args:
+            s_t stoch_state (batch_sz, stoch_size): stochastic state (or posterior)
+            h_t deterministic (batch_sz, deterministic_size): deterministic state
+        Returns:
+            dist: Beurnoulli distribution of done
+        """
+        x = torch.cat((stoch_state, deterministic), -1)
+        batch_shape = x.shape[:-1]
+        if not batch_shape:
+            batch_shape = (1, )
+        x = x.reshape(-1, x.shape[-1])
+        x = self.net(x).reshape(*batch_shape, 1)
+        return x, torch.distributions.Independent(torch.distributions.Bernoulli(logits=x), 1)
+class Actor(nn.Module):
+    """actor network
+    """
+    def __init__(self,
+                 latent_size,
+                 hidden_size,
+                 action_size,
+                 discrete=True,
+                 activation=nn.ELU,
+                 min_std=1e-4,
+                 init_std=5,
+                 mean_scale=5):
+        super().__init__()
+        self.latent_size = latent_size
+        self.hidden_size = hidden_size
+        self.action_size = (action_size if discrete else action_size*2)
+        self.discrete = discrete
+        self.min_std=min_std
+        self.init_std = init_std
+        self.mean_scale = mean_scale
+        self.net = nn.Sequential(
+            nn.Linear(latent_size, hidden_size),
+            activation(),
+            nn.Linear(hidden_size, self.action_size)
+        )
+    def forward(self, stoch_state, deterministic):
+        """actor network. get in stochastic state and deterministic state to construct latent state
+            and then use latent state to predict appropriate action
+        Args:
+            s_t stoch_state (batch_sz, stoch_size): stochastic state (or posterior)
+            h_t deterministic (batch_sz, deterministic_size): deterministic state
+        Returns:
+            action distribution. OneHot if discrete, else is tanhNormal
+        """
+        latent_state = torch.cat((stoch_state, deterministic), -1)
+        x = self.net(latent_state)
+        if self.discrete:
+            # straight through gradient (mentioned in DreamerV2)
+            dist = torch.distributions.OneHotCategorical(logits=x)
+            action = dist.sample() + dist.probs - dist.probs.detach()
+        else:
+            #ensure that the softplut output proper init_std
+            raw_init_std = np.log(np.exp(self.init_std) - 1)
+            mean, std = torch.chunk(x, 2, -1)
+            mean = self.mean_scale * F.tanh(mean / self.mean_scale)
+            std = F.softplus(std + raw_init_std) + self.min_std
+            dist = torch.distributions.Normal(mean, std)
+            dist = torch.distributions.TransformedDistribution(dist, torch.distributions.TanhTransform())
+            action = torch.distributions.Independent(dist, 1).rsample()
+        return action
+class Critic(nn.Module):
+    """
+    critic network
+    """
+    def __init__(self, latent_size, hidden_size, activation=nn.ELU):
+        super().__init__()
+        self.latent_size = latent_size
+        self.net = nn.Sequential(
+            nn.Linear(latent_size, hidden_size),
+            activation(),
+            nn.Linear(hidden_size, hidden_size),
+            activation(),
+            nn.Linear(hidden_size, 1)
+        )
+    def forward(self, stoch_state, deterministic):
+        """critic network. get in stochastic state and deterministic state to construct latent state
+            and then use latent state to predict state value
+        Args:
+            s_t stoch_state (batch_sz, seq_len, stoch_size): stochastic state (or posterior)
+            h_t deterministic (batch_sz, seq_len, deterministic_size): deterministic state
+        Returns:
+            state value distribution.
+        """
+        latent_state = torch.cat((stoch_state, deterministic), -1)
+        batch_shape = latent_state.shape[:-1]
+        if not batch_shape:
+            batch_shape = (1, )
+        latent_state = latent_state.reshape(-1, self.latent_size)
+        x = self.net(latent_state)
+        x = x.reshape(*batch_shape, 1)
+        dist = torch.distributions.Normal(x, 1)
+        dist = torch.distributions.Independent(dist, 1)
+        return dist

utils/utils.py ADDED Viewed

	@@ -0,0 +1,41 @@

+import torch
+def log_metrics(metrics, step, tb_writer, wandb_writer):
+    # Log metrics to TensorBoard
+    if tb_writer:
+        for key, value in metrics.items():
+            tb_writer.add_scalar(key, value, step)
+    # Log metrics to wandb
+    # if wandb_writer:
+    #     wandb_writer.log(metrics, step=step)
+def td_lambda(rewards, predicted_discount, values, lambda_, device):
+    """
+    Compute the TD(λ) returns for value estimation.
+    Args:
+    - rewards (Tensor): Tensor of rewards with shape [batch_size, horizon_len, 1].
+    - predicted_discount (Tensor): Tensor indicating probability of episode termination with shape [batch_size, horizon_len, 1].
+    - values (Tensor): Tensor of value estimates with shape [batch_size, horizon_len, 1].
+    - lambda_ (float): The λ parameter in TD(λ) controlling bias-variance tradeoff.
+    Returns:
+    - td_lambda (Tensor): The computed lambda returns with shape [batch_size, time_steps - 1].
+    """
+    batch_size, _, _ = rewards.shape
+    last_lambda = torch.zeros((batch_size, 1)).to(device)
+    cur_rewards = rewards[:, :-1]
+    next_values = values[:, 1:]
+    predicted_discount = predicted_discount[:, :-1]
+    td_1 = cur_rewards + predicted_discount * next_values * (1 - lambda_)
+    returns = torch.zeros_like(cur_rewards).to(device)
+    for i in reversed(range(td_1.size(1))):
+        last_lambda = td_1[:, i] + predicted_discount[:, i] * lambda_ * last_lambda
+        returns[:, i] = last_lambda
+    return returns

utils/wrappers.py ADDED Viewed

	@@ -0,0 +1,265 @@

+"""
+Author: Minh Pham-Dinh
+Created: Feb 4th, 2024
+Last Modified: Feb 7th, 2024
+Email: [email protected]
+Description:
+    File containing wrappers for different environment types.
+"""
+import gymnasium as gym
+from dm_control import suite
+from dm_control.suite.wrappers import pixels
+import numpy as np
+import cv2
+import os
+from dm_control import suite
+from dm_control.rl.control import Environment
+#wrapper by Hafner et al
+class ActionRepeat:
+    def __init__(self, env, repeats):
+        self.env = env
+        self.repeats = repeats
+    def __getattr__(self, name):
+        return getattr(self.env, name)
+    def step(self, action):
+        done = False
+        total_reward = 0
+        current_step = 0
+        while current_step < self.repeats and not done:
+            obs, reward, termination, truncation, info = self.env.step(action)
+            total_reward += reward
+            current_step += 1
+            done = termination or truncation
+        return obs, total_reward, termination, truncation, info
+#wrapper by Hafner et al
+class NormalizeActions:
+    """
+    A wrapper class that normalizes the action space of an environment.
+    Args:
+        env (gym.Env): The environment to be wrapped.
+    Attributes:
+        _env (gym.Env): The original environment.
+        _mask (numpy.ndarray): A boolean mask indicating which action dimensions are finite.
+        _low (numpy.ndarray): The lower bounds of the action space.
+        _high (numpy.ndarray): The upper bounds of the action space.
+    """
+    def __init__(self, env):
+        self._env = env
+        self._mask = np.logical_and(
+            np.isfinite(env.action_space.low),
+            np.isfinite(env.action_space.high))
+        self._low = np.where(self._mask, env.action_space.low, -1)
+        self._high = np.where(self._mask, env.action_space.high, 1)
+    def __getattr__(self, name):
+        """
+        Delegate attribute access to the original environment.
+        Args:
+            name (str): The name of the attribute.
+        Returns:
+            Any: The value of the attribute in the original environment.
+        """
+        return getattr(self._env, name)
+    @property
+    def action_space(self):
+        """
+        Get the normalized action space.
+        Returns:
+            gym.spaces.Box: The normalized action space.
+        """
+        low = np.where(self._mask, -np.ones_like(self._low), self._low)
+        high = np.where(self._mask, np.ones_like(self._low), self._high)
+        return gym.spaces.Box(low, high, dtype=np.float32)
+    def step(self, action):
+        """
+        Take a step in the environment with a normalized action.
+        Args:
+            action (numpy.ndarray): The normalized action.
+        Returns:
+            Tuple: A tuple containing the next state, reward, done flag, and additional information.
+        """
+        original = (action + 1) / 2 * (self._high - self._low) + self._low
+        original = np.where(self._mask, original, action)
+        return self._env.step(original)
+class DMCtoGymWrapper(gym.Env):
+    """
+    Wrapper to convert a DeepMind Control Suite environment to a Gymnasium environment with additional features like recording and episode truncation.
+    Args:
+        domain_name (str): The name of the domain.
+        task_name (str): The name of the task.
+        task_kwargs (dict, optional): Additional kwargs for the task.
+        visualize_reward (bool, optional): Whether to visualize the reward. Defaults to False.
+        resize (list, optional): New size to resize observations. Defaults to [64, 64].
+        record (bool, optional): Whether to record episodes. Defaults to False.
+        record_freq (int, optional): Frequency (in episodes) to record. Defaults to 100.
+        record_path (str, optional): Path to save recorded videos. Defaults to '../'.
+        max_episode_steps (int, optional): Maximum steps per episode for truncation. Defaults to 1000.
+    """
+    def __init__(self, domain_name, task_name, task_kwargs=None, visualize_reward=False, resize=[64,64], record=False, record_freq=100, record_path='../', max_episode_steps=1000, camera=None):
+        super().__init__()
+        self.env = suite.load(domain_name, task_name, task_kwargs=task_kwargs, visualize_reward=visualize_reward)
+        self.episode_count = -1
+        self.record = record
+        self.record_freq = record_freq
+        self.record_path = record_path
+        self.max_episode_steps = max_episode_steps
+        self.current_step = 0
+        self.total_reward = 0
+        self.recorder = None
+        # Define action and observation space based on the DMC environment
+        action_spec = self.env.action_spec()
+        self.action_space = gym.spaces.Box(low=action_spec.minimum, high=action_spec.maximum, dtype=np.float32)
+        # Initialize the pixels wrapper for observation space
+        self.env = pixels.Wrapper(self.env, pixels_only=True)
+        self.resize = resize  # Assuming RGB images
+        self.observation_space = gym.spaces.Box(low=-0.5, high=+0.5, shape=(3, *resize), dtype=np.float32)
+        if camera is None:
+            camera = dict(quadruped=2).get(domain_name, 0)
+        self._camera = camera
+    def step(self, action):
+        time_step = self.env.step(action)
+        obs = self._get_obs(self.env)
+        reward = time_step.reward if time_step.reward is not None else 0
+        self.total_reward += (reward or 0)
+        self.current_step += 1
+        termination = time_step.last()
+        truncation = (self.current_step == self.max_episode_steps)
+        info = {}
+        if termination or truncation:
+            info = {
+                'episode': {
+                    'r': [self.total_reward],
+                    'l': self.current_step
+                }
+            }
+        if self.recorder:
+            frame = cv2.cvtColor(self.env.physics.render(camera_id=self._camera), cv2.COLOR_RGB2BGR)
+            self.recorder.write(frame)
+            video_file = os.path.join(self.record_path, f"episode_{self.episode_count}.webm")
+            if termination or truncation:
+                self._reset_recorder()
+                info['video_path'] = video_file
+        return obs, reward, termination, truncation, info
+    def reset(self):
+        self.current_step = 0
+        self.total_reward = 0
+        self.episode_count += 1
+        time_step = self.env.reset()
+        obs = self._get_obs(self.env)
+        if self.record and self.episode_count % self.record_freq == 0:
+            self._start_recording(self.env.physics.render(camera_id=self._camera))
+        return obs, {}
+    def _start_recording(self, frame):
+        if not os.path.exists(self.record_path):
+            os.makedirs(self.record_path)
+        video_file = os.path.join(self.record_path, f"episode_{self.episode_count}.webm")
+        height, width, _ = frame.shape
+        self.recorder = cv2.VideoWriter(video_file, cv2.VideoWriter_fourcc(*'vp80'), 30, (width, height))
+        self.recorder.write(frame)
+    def _reset_recorder(self):
+        if self.recorder:
+            self.recorder.release()
+            self.recorder = None
+    def _get_obs(self, env):
+        obs = self.render()
+        obs = obs/255 - 0.5
+        rearranged_obs = obs.transpose([2,0,1])
+        return rearranged_obs
+    def render(self, mode='rgb_array'):
+        return self.env.physics.render(*self.resize, camera_id=self._camera)  # Adjust camera_id based on the environment
+class AtariPreprocess(gym.Wrapper):
+    """
+    A custom Gym wrapper that integrates multiple environment processing steps:
+    - Records episode statistics and videos.
+    - Resizes observations to a specified shape.
+    - Scales and reorders observation channels.
+    - Scales rewards using the tanh function.
+    Parameters:
+    - env (gym.Env): The original environment to wrap.
+    - new_obs_size (tuple): The target size for observation resizing (height, width).
+    - record (bool): If True, enable video recording.
+    - record_path (str): The directory path where videos will be saved.
+    - record_freq (int): Frequency (in episodes) at which to record videos.
+    """
+    def __init__(self, env, new_obs_size, record=False, record_path='../videos/', record_freq=100):
+        super().__init__(env)
+        self.env = gym.wrappers.RecordEpisodeStatistics(env)
+        if record:
+            self.env = gym.wrappers.RecordVideo(self.env, record_path, episode_trigger=lambda episode_id: episode_id % record_freq == 0)
+        self.env = gym.wrappers.ResizeObservation(self.env, shape=new_obs_size)
+        self.new_obs_size = new_obs_size
+        self.observation_space = gym.spaces.Box(
+            low=-0.5, high=0.5,
+            shape=(3, new_obs_size[0], new_obs_size[1]),
+            dtype=np.float32
+        )
+    def step(self, action):
+        obs, reward, termination, truncation, info = super().step(action)
+        obs = self.process_observation(obs)
+        reward = np.tanh(reward)  # Scale reward
+        return obs, reward, termination, truncation, info
+    def reset(self, **kwargs):
+        obs, info = super().reset(**kwargs)
+        obs = self.process_observation(obs)
+        return obs, info
+    def process_observation(self, observation):
+        """
+        Process and return the observation from the environment.
+        - Scales pixel values to the range [-0.5, 0.5].
+        - Reorders channels to CHW format (channels, height, width).
+        Parameters:
+        - observation (np.ndarray): The original observation from the environment.
+        Returns:
+        - np.ndarray: The processed observation.
+        """
+        if 'pixels' in observation:
+            observation = observation['pixels']
+        observation = observation / 255.0 - 0.5
+        observation = np.transpose(observation, (2, 0, 1))
+        return observation