-
Christoph Kowalski authoredChristoph Kowalski authored
- Cooperative Cuisine and Reinforcement Learning
- Key Python Files in Reinforcement Learning Folder
- Obs Converter Subfolder
- Configurations and Hydra Integration
- Overcooked-AI and Cooperative Cuisine
- Using Overcooked-AI Levels and Configs in Cooperative Cuisine
- Defining the Connection between Overcooked-AI and Cooperative Cuisine
- Results on Overcooked-AI Layouts
- Experiences with Reinforcement Learning on the Cooperative Cuisine Environment
- Introducing Intermediate Rewards
- Shuffle Counter
- Increasing Environment Size
- Using the Cutting Board
- PPO (Proximal Policy Optimization) Insights
- Recommended PPO Hyperparameters:
- Results
- Results on the overcooked-AI layouts
Cooperative Cuisine and Reinforcement Learning
Cooperative Cuisine can be used to train a reinforcment learning agent. In this implementation stable_baselines is used to load the rl algorithm
Key Python Files in Reinforcement Learning Folder
-
gym_env.py
Implements the typical reinforcement learning functions:step
andreset
. Additionally, it calls thestate_to_observation
converter for the reinforcement learning process to learn on the predefined representation. -
train_single_agent.py
Trains a single agent using predefined configurations (managed with Hydra). Also enables multirun or a hyperparameter sweeper, if defined in therl_config.yaml
. -
run_single_agent.py
Enables loading a trained agent and allows the agent to play in the environment. -
play_gym.py
Allows a user to play a character in the gym manually. This can be helpful to inspect the representation or test different hooks and rewards.
Obs Converter Subfolder
Within the obs_converter
subfolder, several converters are defined for converting the environment into vector representations, which are then used in gym_env.py
. When developing new converters, ensure they are properly flattened as only flattened arrays are processed correctly by PPO. Note that CNNPolicy is supported for images only and not for multi-dimensional vectors.
Configurations and Hydra Integration
All configurations are managed with Hydra. In the reinforcement_learning/config
folder, the rl_config.yaml
file contains the main configuration details. The following subfolders hold configurations that can be overridden via command line arguments:
-
Model Configs: Located in the
model
folder, these hold possible models with their respective hyperparameters. -
Sweeper: A hyperparameter sweeper is implemented in
rl_config.yaml
and is activated whentrain_single_agent.py
is called with the multirun argument. If not, the normal model parameters are used.
The layout files for the project are stored in the cooperative_cuisine/config
folder rather than the reinforcement_learning/config
folder.
Weights & Biases integration is used to track and manage the training process.
Overcooked-AI and Cooperative Cuisine
Using Overcooked-AI Levels and Configs in Cooperative Cuisine
All layouts from Overcooked-AI can be used within Cooperative Cuisine. Dedicated configs are defined and can be loaded via Hydra. To use Overcooked-AI layouts:
- Set the
overcooked-ai_environment_config.yaml
as the environment config. - Define any layout from Overcooked-AI under
layout_name
. - Set
item_config
toitem_info_overcooked-ai.yaml
.
These configurations ensure that Overcooked-AI layouts and rewards are applied.
Defining the Connection between Overcooked-AI and Cooperative Cuisine
Cooperative Cuisine is highly modular, thanks to Hydra as the config manager. Parameters from Overcooked-AI are used directly in the config file, with a layout mapping defined to convert Overcooked-AI layouts into the Cooperative Cuisine format.
Results on Overcooked-AI Layouts
Since Overcooked-AI lacks a cutting board and does not include random environments, Cooperative Cuisine was able to replicate Overcooked-AI results and achieve good performance on several layouts, even with random counter placement.
Experiences with Reinforcement Learning on the Cooperative Cuisine Environment
Introducing Intermediate Rewards
Introducing intermediate rewards is crucial as a meal can require up to 20 moves, making it unlikely for an agent to succeed by chance. Intermediate rewards should be small compared to final rewards, but they help guide the agent toward good actions. Careful balancing is needed to manage negative rewards for actions like using the trashcan, as too high a penalty could discourage learning entirely, while too lenient a policy could lead to undesirable behaviors like throwing away items unnecessarily.
Shuffle Counter
Using a pre-defined environment simplifies the problem, allowing the agent to learn quickly. However, this approach limits insight into how well the agent utilizes the representation. Introducing random counter shuffling increases difficulty, forcing the agent to depend more on the representation, thus making the learning process more challenging and meaningful.
Increasing Environment Size
Increasing environment size significantly impacts the agent's learning process. More steps are required for actions, particularly for tasks like plating meals. Combining large environments with cutting tasks creates additional complexity. If cutting is not required, agents handle large (and even dynamic) environments well, as interaction actions are less critical.
Using the Cutting Board
The cutting board presents a major challenge for the agent, especially when multiple cut items are needed. The agent can become fixated on cutting tasks and struggle with other actions like cooking afterward. Careful reward shaping is essential for addressing this issue.
PPO (Proximal Policy Optimization) Insights
PPO can be unstable, showing good progress and then plateauing. A recommended game time limit is between 150-300
seconds, depending on the complexity of the task. For faster training, a lower time limit can be effective.
Recommended PPO Hyperparameters:
- Ent_coef: 0 and 0.01 to aid exploration.
- Batch size: 256
- Number of environments (n_envs): 32
- Learning rate: 0.0006
- Gamma: Set high for long-term rewards.
The number of timesteps varies significantly based on the task's complexity (e.g., whether cutting is required, environment size).
Results
Results on the overcooked-AI layouts
Preparing onion soup in the overcooked-ai centre-pots environment with a fixed counter layout
Preparing onion soup in the overcooked-ai centre-pots environment with a fixed counter layout and added cutting board
Preparing onion soup in the overcooked-ai large environment with a fixed counter layout
Preparing onion soup in the overcooked-ai large environment with a random counter layout
Preparing a tomato soup in the cooperative cuisine environment with a fixed counter layout
Preparing a tomato soup in the cooperative cuisine environment with a random counter layout
Preparing a salad in the cooperative cuisine environment with a random counter layout