README.md



Cooperative Cuisine and Reinforcement Learning
Cooperative Cuisine can be used to train a reinforcment learning agent. In this implementation stable_baselines is used to load the rl algorithm

  
Key Python Files in Reinforcement Learning Folder


gym_env.py

Implements the typical reinforcement learning functions: step and reset. Additionally, it calls the state_to_observation converter for the reinforcement learning process to learn on the predefined representation.


train_single_agent.py

Trains a single agent using predefined configurations (managed with Hydra). Also enables multirun or a hyperparameter sweeper, if defined in the rl_config.yaml.


run_single_agent.py

Enables loading a trained agent and allows the agent to play in the environment.


play_gym.py

Allows a user to play a character in the gym manually. This can be helpful to inspect the representation or test different hooks and rewards.


Obs Converter Subfolder
Within the obs_converter subfolder, several converters are defined for converting the environment into vector representations, which are then used in gym_env.py. When developing new converters, ensure they are properly flattened as only flattened arrays are processed correctly by PPO. Note that CNNPolicy is supported for images only and not for multi-dimensional vectors.


Configurations and Hydra Integration
All configurations are managed with Hydra. In the reinforcement_learning/config folder, the rl_config.yaml file contains the main configuration details. The following subfolders hold configurations that can be overridden via command line arguments:


Model Configs: Located in the model folder, these hold possible models with their respective hyperparameters.

Sweeper: A hyperparameter sweeper is implemented in rl_config.yaml and is activated when train_single_agent.py is called with the multirun argument. If not, the normal model parameters are used.

The layout files for the project are stored in the cooperative_cuisine/config folder rather than the reinforcement_learning/config folder.
Weights & Biases integration is used to track and manage the training process.


Overcooked-AI and Cooperative Cuisine

Using Overcooked-AI Levels and Configs in Cooperative Cuisine
All layouts from Overcooked-AI can be used within Cooperative Cuisine. Dedicated configs are defined and can be loaded via Hydra. To use Overcooked-AI layouts:

Set the overcooked-ai_environment_config.yaml as the environment config.
Define any layout from Overcooked-AI under layout_name.
Set item_config to item_info_overcooked-ai.yaml.

These configurations ensure that Overcooked-AI layouts and rewards are applied.


Defining the Connection between Overcooked-AI and Cooperative Cuisine
Cooperative Cuisine is highly modular, thanks to Hydra as the config manager. Parameters from Overcooked-AI are used directly in the config file, with a layout mapping defined to convert Overcooked-AI layouts into the Cooperative Cuisine format.


Results on Overcooked-AI Layouts
Since Overcooked-AI lacks a cutting board and does not include random environments, Cooperative Cuisine was able to replicate Overcooked-AI results and achieve good performance on several layouts, even with random counter placement.


Experiences with Reinforcement Learning on the Cooperative Cuisine Environment

Introducing Intermediate Rewards
Introducing intermediate rewards is crucial as a meal can require up to 20 moves, making it unlikely for an agent to succeed by chance. Intermediate rewards should be small compared to final rewards, but they help guide the agent toward good actions. Careful balancing is needed to manage negative rewards for actions like using the trashcan, as too high a penalty could discourage learning entirely, while too lenient a policy could lead to undesirable behaviors like throwing away items unnecessarily.


Shuffle Counter
Using a pre-defined environment simplifies the problem, allowing the agent to learn quickly. However, this approach limits insight into how well the agent utilizes the representation. Introducing random counter shuffling increases difficulty, forcing the agent to depend more on the representation, thus making the learning process more challenging and meaningful.


Increasing Environment Size
Increasing environment size significantly impacts the agent's learning process. More steps are required for actions, particularly for tasks like plating meals. Combining large environments with cutting tasks creates additional complexity. If cutting is not required, agents handle large (and even dynamic) environments well, as interaction actions are less critical.


Using the Cutting Board
The cutting board presents a major challenge for the agent, especially when multiple cut items are needed. The agent can become fixated on cutting tasks and struggle with other actions like cooking afterward. Careful reward shaping is essential for addressing this issue.


PPO (Proximal Policy Optimization) Insights
PPO can be unstable, showing good progress and then plateauing.  A recommended game time limit is between 150-300 seconds, depending on the complexity of the task. For faster training, a lower time limit can be effective.

Recommended PPO Hyperparameters:


Ent_coef: 0 and 0.01 to aid exploration.

Batch size: 256

Number of environments (n_envs): 32

Learning rate: 0.0006

Gamma: Set high for long-term rewards.

The number of timesteps varies significantly based on the task's complexity (e.g., whether cutting is required, environment size).

Results

Results on the overcooked-AI layouts

Preparing onion soup in the overcooked-ai centre-pots environment with a fixed counter layout

Preparing onion soup in the overcooked-ai centre-pots environment with a fixed counter layout and added cutting board

Preparing onion soup in the overcooked-ai large environment with a fixed counter layout

Preparing onion soup in the overcooked-ai large environment with a random counter layout

  
   Preparing a tomato soup in the cooperative cuisine environment with a fixed counter layout


Preparing a tomato soup in the cooperative cuisine environment with a random counter layout

Preparing a salad in the cooperative cuisine environment with a random counter layout