README.md

# Cooperative Cuisine and Reinforcement Learning

Cooperative Cuisine can be used to train a reinforcment learning agent. In this implementation, [stable_baselines](https://github.com/hill-a/stable-baselines) rl algorithms are used.

<p align="center">
  <img src="./data/tomato_soup_fixed_small_env_third.gif" width="12%" margin-right= "1%;" />
  <img src="./data/tomato_soup_small_random_env_third.gif" width="12%" margin-right= "1%;" />
  <img src="./data/salad_fixed_small_env_third.gif" width="12%" margin-right= "1%;" />
  <img src="./data/onion_soup_centre_pots_fixed_env_overcooked-ai_third.gif" width="17%" margin-right= "1%;" />
  <img src="./data/onion_soup_centre_pots_fixed_env_overcooked-ai_with_cutting_third.gif" width="17%" margin-right= "1%;" />
  <img src="./data/onion_soup_large_env_overcooked-ai_third.gif" width="12%" margin-right= "1%;" />
  <img src="./data/onion_soup_large_random_env_overcooked-ai_third.gif" width="12%" margin-right= "1%;" />
</p>

## Key Python Files in Reinforcement Learning Folder

1. **[gym_env.py](./gym_env.py)**  
   Implements the typical reinforcement learning functions: `step` and `reset`. Additionally, it calls the `state_to_observation` converter for the reinforcement learning process to learn on the predefined representation.

2. **[train_single_agent.py](./train_single_agent.py)**  
   Trains a single agent using predefined configurations (managed with Hydra). Also enables multirun or a hyperparameter sweeper, if defined in the `rl_config.yaml`.

3. **[run_single_agent.py](./run_single_agent.py)**  
   Enables loading a trained agent and allows the agent to play in the environment.

4. **[play_gym.py](./play_gym.py)**  
   Allows a user to play a character in the gym manually. This can be helpful to inspect the representation or test different hooks and rewards.

### Obs Converter Subfolder
Within the `obs_converter` subfolder, several converters are defined for converting the environment into vector representations, which are then used in `gym_env.py`. When developing new converters, ensure they are properly flattened as only flattened arrays are processed correctly by PPO. Note that **CNNPolicy** is supported for images only and not for multi-dimensional vectors.

---

### Configurations and Hydra Integration

All configurations are managed with [Hydra](https://hydra.cc/). In the `reinforcement_learning/config` folder, the [`rl_config.yaml`](./config/rl_config.yaml) file contains the main configuration details. The following subfolders hold configurations that can be overridden via command line arguments:

- **Model Configs:** Located in the `model` folder, these hold possible models with their respective hyperparameters.
- **Sweeper:** A hyperparameter sweeper is implemented in `rl_config.yaml` and is activated when `train_single_agent.py` is called with the multirun argument. If not, the normal model parameters are used.

The layout files for the project are stored in the `cooperative_cuisine/config` folder rather than the `reinforcement_learning/config` folder.

**Weights & Biases** integration is used to track and manage the training process.

---

## Overcooked-AI and Cooperative Cuisine

### Using Overcooked-AI Levels and Configs in Cooperative Cuisine

All layouts from [**Overcooked-AI**](https://github.com/HumanCompatibleAI/overcooked_ai) can be used within Cooperative Cuisine. Dedicated configs are defined and can be loaded via Hydra. To use Overcooked-AI layouts:

1. Set the [`overcooked-ai_environment_config.yaml`](./config/environment/overcooked-ai_environment_config.yaml) as the environment config.
2. Define any layout from Overcooked-AI under `layout_name`.
3. Set `item_config` to [`item_info_overcooked-ai.yaml`](./config/item_info/item_info_overcooked-ai.yaml).

These configurations ensure that Overcooked-AI layouts and rewards are applied.

---

### Defining the Connection between Overcooked-AI and Cooperative Cuisine

Cooperative Cuisine is highly modular, thanks to Hydra as the config manager. Parameters from Overcooked-AI are used directly in the config file, with a layout mapping defined to convert Overcooked-AI layouts into the Cooperative Cuisine format.

---

### Results on Overcooked-AI Layouts

Since Overcooked-AI lacks a cutting board and does not include random environments, Cooperative Cuisine was able to replicate Overcooked-AI results and achieve good performance on several layouts, even with random counter placement.

---

## Experiences with Reinforcement Learning on the Cooperative Cuisine Environment

### Introducing Intermediate Rewards

Introducing intermediate rewards is crucial as a meal can require up to 20 moves, making it unlikely for an agent to succeed by chance. Intermediate rewards should be small compared to final rewards, but they help guide the agent toward good actions. Careful balancing is needed to manage negative rewards for actions like using the trashcan, as too high a penalty could discourage learning entirely, while too lenient a policy could lead to undesirable behaviors like throwing away items unnecessarily.

---

### Shuffle Counter

Using a pre-defined environment simplifies the problem, allowing the agent to learn quickly. However, this approach limits insight into how well the agent utilizes the representation. Introducing random counter shuffling increases difficulty, forcing the agent to depend more on the representation, thus making the learning process more challenging and meaningful.

---

### Increasing Environment Size

Increasing environment size significantly impacts the agent's learning process. More steps are required for actions, particularly for tasks like plating meals. Combining large environments with cutting tasks creates additional complexity. If cutting is not required, agents handle large (and even dynamic) environments well, as interaction actions are less critical.

---

### Using the Cutting Board

The cutting board presents a major challenge for the agent, especially when multiple cut items are needed. The agent can become fixated on cutting tasks and struggle with other actions like cooking afterward. Careful reward shaping is essential for addressing this issue.

---

### PPO (Proximal Policy Optimization) Insights

PPO can be unstable, showing good progress and then plateauing.  A recommended game time limit is between `150-300` seconds, depending on the complexity of the task. For faster training, a lower time limit can be effective.

#### Recommended PPO Hyperparameters:
- **Entropy Coefficient (ent_coef):** 0 and 0.01 to aid exploration.
- **Batch size:** 256
- **Number of environments (n_envs):** 32
- **Learning rate:** 0.0006
- **Gamma:** Set high for long-term rewards.

The number of timesteps varies significantly based on the task's complexity (e.g., whether cutting is required, environment size).

## Results 

## Results on the overcooked-AI layouts

<p align="center">
  <img src="./data/onion_soup_centre_pots_fixed_env_overcooked-ai_row.gif" width="100%"  />
Preparing onion soup in the overcooked-ai centre-pots environment with a fixed counter layout.
</p>

<br/>
<br/>


<p align="center">
  <img src="./data/onion_soup_centre_pots_fixed_env_overcooked-ai_with_cutting_row.gif" width="100%"  />
Preparing onion soup in the overcooked-ai centre-pots environment with a fixed counter layout and added a cutting board. 
</p>

<br/>
<br/>


<p align="center">
  <img src="./data/onion_soup_large_env_overcooked-ai_row.gif" width="100%"  />
Preparing onion soup in the overcooked-ai large environment with a fixed counter layout.
</p>

<br/>
<br/>


<p align="center">
  <img src="./data/onion_soup_large_random_env_overcooked-ai_row.gif" width="100%"  />
Preparing onion soup in the overcooked-ai large environment with a random counter layout.
</p>

<br/>
<br/>


<p align="center">
  <img src="./data/tomato_soup_fixed_small_env_row1.gif" width="49.5%"  />
  <img src="./data/tomato_soup_fixed_small_env_row2.gif" width="49.5%" />
   Preparing a tomato soup in the cooperative cuisine environment with a fixed counter layout.
</p>

<br/>
<br/>

<p align="center">
  <img src="./data/tomato_soup_small_random_env_row.gif" width="120%"  />
Preparing a tomato soup in the cooperative cuisine environment with a random counter layout.
</p>
<br/>
<br/>

<p align="center">
  <img src="./data/salad_fixed_small_env_row.gif" width="120%"  />
Preparing a salad in the cooperative cuisine environment with a random counter layout. 
</p>

---- 
by Christoph Kowalski