Skip to content
Snippets Groups Projects
Commit 7b29ad5e authored by Christoph Kowalski's avatar Christoph Kowalski
Browse files

Update READEME and cleanup folder

parent 52f4612e
No related branches found
No related tags found
2 merge requests!110V1.2.0 changes,!109SB3 RL with Hydra
Pipeline #63922 passed
Showing
with 130 additions and 750 deletions
# Cooperative Cuisine and Reinforcement Learning
## Key Python Files in Reinforcement Learning Folder
1. **gym_env.py**
Implements the typical reinforcement learning functions: `step` and `reset`. Additionally, it calls the `state_to_observation` converter for the reinforcement learning process to learn on the predefined representation.
2. **train_single_agent.py**
Trains a single agent using predefined configurations (managed with Hydra). Also enables multirun or a hyperparameter sweeper, if defined in the `rl_config.yaml`.
3. **run_single_agent.py**
Enables loading a trained agent and allows the agent to play in the environment.
4. **play_gym.py**
Allows a user to play a character in the gym manually. This can be helpful to inspect the representation or test different hooks and rewards.
### Obs Converter Subfolder
Within the `obs_converter` subfolder, several converters are defined for converting the environment into vector representations, which are then used in `gym_env.py`. When developing new converters, ensure they are properly flattened as only flattened arrays are processed correctly by PPO. Note that **CNNPolicy** is supported for images only and not for multi-dimensional vectors.
---
### Configurations and Hydra Integration
All configurations are managed with [Hydra](https://hydra.cc/). In the `reinforcement_learning/config` folder, the `rl_config.yaml` file contains the main configuration details. The following subfolders hold configurations that can be overridden via command line arguments:
- **Model Configs:** Located in the `model` folder, these hold possible models with their respective hyperparameters.
- **Sweeper:** A hyperparameter sweeper is implemented in `rl_config.yaml` and is activated when `train_single_agent.py` is called with the multirun argument. If not, the normal model parameters are used.
The layout files for the project are stored in the `cooperative_cuisine/config` folder rather than the `reinforcement_learning/config` folder.
**Weights & Biases** integration is used to track and manage the training process.
---
## Overcooked-AI and Cooperative Cuisine
### Using Overcooked-AI Levels and Configs in Cooperative Cuisine
All layouts from **Overcooked-AI** can be used within Cooperative Cuisine. Dedicated configs are defined and can be loaded via Hydra. To use Overcooked-AI layouts:
1. Set the `overcooked-ai_environment_config.yaml` as the environment config.
2. Define any layout from Overcooked-AI under `layout_name`.
3. Set `item_config` to `item_info_overcooked-ai.yaml`.
These configurations ensure that Overcooked-AI layouts and rewards are applied.
---
### Defining the Connection between Overcooked-AI and Cooperative Cuisine
Cooperative Cuisine is highly modular, thanks to Hydra as the config manager. Parameters from Overcooked-AI are used directly in the config file, with a layout mapping defined to convert Overcooked-AI layouts into the Cooperative Cuisine format. These layout files must be present in `cooperative_cuisine/reinforcement_learning/layouts/overcooked_ai_layouts`.
---
### Results on Overcooked-AI Layouts
Since Overcooked-AI lacks a cutting board and does not include random environments, Cooperative Cuisine was able to replicate Overcooked-AI results and achieve good performance on several layouts, even with random counter placement.
---
## Experiences with Reinforcement Learning on the Cooperative Cuisine Environment
### Introducing Intermediate Rewards
Introducing intermediate rewards is crucial as a meal can require up to 20 moves, making it unlikely for an agent to succeed by chance. Intermediate rewards should be small compared to final rewards, but they help guide the agent toward good actions. Careful balancing is needed to manage negative rewards for actions like using the trashcan, as too high a penalty could discourage learning entirely, while too lenient a policy could lead to undesirable behaviors like throwing away items unnecessarily.
---
### Shuffle Counter
Using a pre-defined environment simplifies the problem, allowing the agent to learn quickly. However, this approach limits insight into how well the agent utilizes the representation. Introducing random counter shuffling increases difficulty, forcing the agent to depend more on the representation, thus making the learning process more challenging and meaningful.
---
### Increasing Environment Size
Increasing environment size significantly impacts the agent's learning process. More steps are required for actions, particularly for tasks like plating meals. Combining large environments with cutting tasks creates additional complexity. If cutting is not required, agents handle large (and even dynamic) environments well, as interaction actions are less critical.
---
### Using the Cutting Board
The cutting board presents a major challenge for the agent, especially when multiple cut items are needed. The agent can become fixated on cutting tasks and struggle with other actions like cooking afterward. Careful reward shaping is essential for addressing this issue.
---
### PPO (Proximal Policy Optimization) Insights
PPO can be unstable, showing good progress and then plateauing. The `ent_coef` value should be set between `0` and `0.01` to aid exploration. A recommended game time limit is between `150-300` seconds, depending on the complexity of the task. For faster training, a lower time limit can be effective.
#### Recommended PPO Hyperparameters:
- **Batch size:** 256
- **Number of environments (n_envs):** 32
- **Learning rate:** 0.0006
- **Gamma:** Set high for long-term rewards.
The number of timesteps varies significantly based on the task's complexity (e.g., whether cutting is required, environment size).
......@@ -7,7 +7,7 @@ plates:
# range of seconds until the dirty plate arrives.
game:
time_limit_seconds: 300
time_limit_seconds: 100
undo_dispenser_pickup: true
validate_recipes: false
......@@ -49,10 +49,9 @@ layout_chars:
dquote: Counter # " wall/truck
p: Counter # second plate return ??
orders:
order_generator:
_target_: "cooperative_cuisine.orders.RandomOrderGeneration"
_target_: "cooperative_cuisine.orders.DeterministicOrderGeneration"
_partial_: true
meals:
all: false
......@@ -61,35 +60,24 @@ orders:
list:
- TomatoSoup
- OnionSoup
- Salad
#- Salad
# - FriedFish
# the class to that receives the kwargs. Should be a child class of OrderGeneration in orders.py
order_gen_kwargs:
order_duration_random_func:
# how long should the orders be alive
# 'random' library call with getattr, kwargs are passed to the function
func: uniform
kwargs:
a: 40
b: 60
max_orders: 6
# maximum number of active orders at the same time
num_start_meals: 2
# number of orders generated at the start of the environment
sample_on_dur_random_func:
# 'random' library call with getattr, kwargs are passed to the function
func: uniform
kwargs:
a: 10
b: 20
sample_on_serving: false
# Sample the delay for the next order only after a meal was served.
# structure: [meal_name, start, duration] (start and duration as seconds or timedeltas https://github.com/wroberts/pytimeparse)
timed_orders:
- [ TomatoSoup, 0:00, 0:10 ]
- [ OnionSoup, 0:00, 0:10 ]
- [ TomatoSoup, 0:10, 0:10 ]
- [ TomatoSoup, 0:15, 0:06 ]
never_no_order: False
never_no_order_update_all_remaining: False
serving_not_ordered_meals: true
# can meals that are not ordered be served / dropped on the serving window
player_config:
radius: 0.4
speed_units_per_seconds: 1
interaction_range: 1.6
interaction_range: 0.6
restricted_view: False
view_angle: 95
......@@ -109,7 +97,6 @@ hook_callbacks:
_partial_: true
callback_class_kwargs:
static_score: 0.95
serve_not_ordered_meals:
hooks: [ serve_not_ordered_meal ]
callback_class:
......@@ -123,35 +110,49 @@ hook_callbacks:
_target_: "cooperative_cuisine.scores.ScoreViaHooks"
_partial_: true
callback_class_kwargs:
static_score: -0.2
static_score: 0.00
item_cut:
hooks: [ cutting_board_100 ]
callback_class:
_target_: "cooperative_cuisine.scores.ScoreViaHooks"
_partial_: true
callback_class_kwargs:
static_score: 0.1
static_score: 0.01
stepped:
hooks: [ post_step ]
callback_class:
_target_: "cooperative_cuisine.scores.ScoreViaHooks"
_partial_: true
callback_class_kwargs:
static_score: -0.01
static_score: 0
combine:
hooks: [ drop_off_on_cooking_equipment ]
callback_class:
_target_: "cooperative_cuisine.scores.ScoreViaHooks"
_partial_: true
callback_class_kwargs:
static_score: 0.01
static_score: 0.005
start_interact:
hooks: [ player_start_interaction ]
callback_class:
_target_: "cooperative_cuisine.scores.ScoreViaHooks"
_partial_: true
callback_class_kwargs:
static_score: 0.01
static_score: 0.0
plate_meal:
hooks: [ plated_meal ]
callback_class:
_target_: "cooperative_cuisine.scores.ScoreViaHooks"
_partial_: true
callback_class_kwargs:
static_score: 0.05
cooking_finished:
hooks: [ cooking_finished ]
callback_class:
_target_: "cooperative_cuisine.scores.ScoreViaHooks"
_partial_: true
callback_class_kwargs:
static_score: 0.1
# json_states:
# hooks: [ json_state ]
......
plates:
clean_plates: 2
dirty_plates: 0
plate_delay: [ 2, 4 ]
return_dirty: False
# range of seconds until the dirty plate arrives.
game:
time_limit_seconds: 100
undo_dispenser_pickup: true
validate_recipes: false
layout_name: configs/layouts/rl/rl_small.layout
layout_chars:
_: Free
hash: Counter # #
A: Agent
pipe: Extinguisher
P: PlateDispenser
C: CuttingBoard
X: Trashcan
$: ServingWindow
S: Sink
+: SinkAddon
at: Plate # @ just a clean plate on a counter
U: Pot # with Stove
Q: Pan # with Stove
O: Peel # with Oven
F: Basket # with DeepFryer
T: Tomato
N: Onion # oNioN
L: Lettuce
K: Potato # Kartoffel
I: Fish # fIIIsh
D: Dough
E: Cheese # chEEEse
G: Sausage # sausaGe
B: Bun
M: Meat
question: Counter # ? mushroom
: Counter
^: Counter
right: Counter
left: Counter
wave: Free # ~ Water
minus: Free # - Ice
dquote: Counter # " wall/truck
p: Counter # second plate return ??
orders:
order_generator:
_target_: "cooperative_cuisine.orders.DeterministicOrderGeneration"
_partial_: true
meals:
all: false
# if all: false -> only orders for these meals are generated
# TODO: what if this list is empty?
list:
- TomatoSoup
- OnionSoup
#- Salad
# - FriedFish
# the class to that receives the kwargs. Should be a child class of OrderGeneration in orders.py
order_gen_kwargs:
# structure: [meal_name, start, duration] (start and duration as seconds or timedeltas https://github.com/wroberts/pytimeparse)
timed_orders:
- [ TomatoSoup, 0:00, 0:10 ]
- [ OnionSoup, 0:00, 0:10 ]
- [ TomatoSoup, 0:10, 0:10 ]
- [ TomatoSoup, 0:15, 0:06 ]
never_no_order: False
never_no_order_update_all_remaining: False
serving_not_ordered_meals: true
player_config:
radius: 0.4
speed_units_per_seconds: 1
interaction_range: 0.6
restricted_view: False
view_angle: 95
effect_manager: { }
# FireManager:
# class: !!python/name:cooperative_cuisine.effects.FireEffectManager ''
# kwargs:
# spreading_duration: [ 5, 10 ]
# fire_burns_ingredients_and_meals: true
hook_callbacks:
# # --------------- Scoring ---------------
orders:
hooks: [ completed_order ]
callback_class:
_target_: "cooperative_cuisine.scores.ScoreViaHooks"
_partial_: true
callback_class_kwargs:
static_score: 0.95
serve_not_ordered_meals:
hooks: [ serve_not_ordered_meal ]
callback_class:
_target_: "cooperative_cuisine.scores.ScoreViaHooks"
_partial_: true
callback_class_kwargs:
static_score: 0.95
trashcan_usages:
hooks: [ trashcan_usage ]
callback_class:
_target_: "cooperative_cuisine.scores.ScoreViaHooks"
_partial_: true
callback_class_kwargs:
static_score: 0.00
item_cut:
hooks: [ cutting_board_100 ]
callback_class:
_target_: "cooperative_cuisine.scores.ScoreViaHooks"
_partial_: true
callback_class_kwargs:
static_score: 0.01
stepped:
hooks: [ post_step ]
callback_class:
_target_: "cooperative_cuisine.scores.ScoreViaHooks"
_partial_: true
callback_class_kwargs:
static_score: 0
combine:
hooks: [ drop_off_on_cooking_equipment ]
callback_class:
_target_: "cooperative_cuisine.scores.ScoreViaHooks"
_partial_: true
callback_class_kwargs:
static_score: 0.005
start_interact:
hooks: [ player_start_interaction ]
callback_class:
_target_: "cooperative_cuisine.scores.ScoreViaHooks"
_partial_: true
callback_class_kwargs:
static_score: 0.0
plate_meal:
hooks: [ plated_meal ]
callback_class:
_target_: "cooperative_cuisine.scores.ScoreViaHooks"
_partial_: true
callback_class_kwargs:
static_score: 0.05
cooking_finished:
hooks: [ cooking_finished ]
callback_class:
_target_: "cooperative_cuisine.scores.ScoreViaHooks"
_partial_: true
callback_class_kwargs:
static_score: 0.1
# json_states:
# hooks: [ json_state ]
# record_class: !!python/name:cooperative_cuisine.recording.LogRecorder ''
# record_class_kwargs:
# record_path: USER_LOG_DIR/ENV_NAME/json_states.jsonl
# actions:
# hooks: [ pre_perform_action ]
# record_class: !!python/name:cooperative_cuisine.recording.LogRecorder ''
# record_class_kwargs:
# record_path: USER_LOG_DIR/ENV_NAME/LOG_RECORD_NAME.jsonl
# random_env_events:
# hooks: [ order_duration_sample, plate_out_of_kitchen_time ]
# record_class: !!python/name:cooperative_cuisine.recording.LogRecorder ''
# record_class_kwargs:
# record_path: USER_LOG_DIR/ENV_NAME/LOG_RECORD_NAME.jsonl
# add_hook_ref: true
# env_configs:
# hooks: [ env_initialized, item_info_config ]
# record_class: !!python/name:cooperative_cuisine.recording.LogRecorder ''
# record_class_kwargs:
# record_path: USER_LOG_DIR/ENV_NAME/LOG_RECORD_NAME.jsonl
# add_hook_ref: true
plates:
clean_plates: 2
dirty_plates: 0
plate_delay: [ 2, 4 ]
return_dirty: False
# range of seconds until the dirty plate arrives.
game:
time_limit_seconds: 300
undo_dispenser_pickup: true
validate_recipes: false
layout_name: configs/layouts/rl/rl_small.layout
layout_chars:
_: Free
hash: Counter # #
A: Agent
pipe: Extinguisher
P: PlateDispenser
C: CuttingBoard
X: Trashcan
$: ServingWindow
S: Sink
+: SinkAddon
at: Plate # @ just a clean plate on a counter
U: Pot # with Stove
Q: Pan # with Stove
O: Peel # with Oven
F: Basket # with DeepFryer
T: Tomato
N: Onion # oNioN
L: Lettuce
K: Potato # Kartoffel
I: Fish # fIIIsh
D: Dough
E: Cheese # chEEEse
G: Sausage # sausaGe
B: Bun
M: Meat
question: Counter # ? mushroom
: Counter
^: Counter
right: Counter
left: Counter
wave: Free # ~ Water
minus: Free # - Ice
dquote: Counter # " wall/truck
p: Counter # second plate return ??
orders:
order_generator:
_target_: "cooperative_cuisine.orders.RandomOrderGeneration"
_partial_: true
meals:
all: true
# if all: false -> only orders for these meals are generated
# TODO: what if this list is empty?
list:
- TomatoSoup
- OnionSoup
- Salad
# the class to that receives the kwargs. Should be a child class of OrderGeneration in orders.py
order_gen_kwargs:
order_duration_random_func:
# how long should the orders be alive
# 'random' library call with getattr, kwargs are passed to the function
func: uniform
kwargs:
a: 40
b: 60
max_orders: 6
# maximum number of active orders at the same time
num_start_meals: 2
# number of orders generated at the start of the environment
sample_on_dur_random_func:
# 'random' library call with getattr, kwargs are passed to the function
func: uniform
kwargs:
a: 10
b: 20
sample_on_serving: false
# Sample the delay for the next order only after a meal was served.
serving_not_ordered_meals: true
# can meals that are not ordered be served / dropped on the serving window
player_config:
radius: 0.4
speed_units_per_seconds: 1
interaction_range: 1.6
restricted_view: False
view_angle: 95
effect_manager: { }
# FireManager:
# class: !!python/name:cooperative_cuisine.effects.FireEffectManager ''
# kwargs:
# spreading_duration: [ 5, 10 ]
# fire_burns_ingredients_and_meals: true
hook_callbacks:
# # --------------- Scoring ---------------
orders:
hooks: [ completed_order ]
callback_class:
_target_: "cooperative_cuisine.scores.ScoreViaHooks"
_partial_: true
callback_class_kwargs:
static_score: 0.1
serve_not_ordered_meals:
hooks: [ serve_not_ordered_meal ]
callback_class:
_target_: "cooperative_cuisine.scores.ScoreViaHooks"
_partial_: true
callback_class_kwargs:
static_score: 0.1
trashcan_usages:
hooks: [ trashcan_usage ]
callback_class:
_target_: "cooperative_cuisine.scores.ScoreViaHooks"
_partial_: true
callback_class_kwargs:
static_score: -0.2
item_cut:
hooks: [ cutting_board_100 ]
callback_class:
_target_: "cooperative_cuisine.scores.ScoreViaHooks"
_partial_: true
callback_class_kwargs:
static_score: 0
stepped:
hooks: [ post_step ]
callback_class:
_target_: "cooperative_cuisine.scores.ScoreViaHooks"
_partial_: true
callback_class_kwargs:
static_score: 0
combine:
hooks: [ drop_off_on_cooking_equipment ]
callback_class:
_target_: "cooperative_cuisine.scores.ScoreViaHooks"
_partial_: true
callback_class_kwargs:
static_score: 0
start_interact:
hooks: [ player_start_interaction ]
callback_class:
_target_: "cooperative_cuisine.scores.ScoreViaHooks"
_partial_: true
callback_class_kwargs:
static_score: 0
# json_states:
# hooks: [ json_state ]
# record_class: !!python/name:cooperative_cuisine.recording.LogRecorder ''
# record_class_kwargs:
# record_path: USER_LOG_DIR/ENV_NAME/json_states.jsonl
# actions:
# hooks: [ pre_perform_action ]
# record_class: !!python/name:cooperative_cuisine.recording.LogRecorder ''
# record_class_kwargs:
# record_path: USER_LOG_DIR/ENV_NAME/LOG_RECORD_NAME.jsonl
# random_env_events:
# hooks: [ order_duration_sample, plate_out_of_kitchen_time ]
# record_class: !!python/name:cooperative_cuisine.recording.LogRecorder ''
# record_class_kwargs:
# record_path: USER_LOG_DIR/ENV_NAME/LOG_RECORD_NAME.jsonl
# add_hook_ref: true
# env_configs:
# hooks: [ env_initialized, item_info_config ]
# record_class: !!python/name:cooperative_cuisine.recording.LogRecorder ''
# record_class_kwargs:
# record_path: USER_LOG_DIR/ENV_NAME/LOG_RECORD_NAME.jsonl
# add_hook_ref: true
# def setup_vectorization(self) -> VectorStateGenerationData:
# grid_base_array = np.zeros(
# (
# int(self.env.kitchen_width),
# int(self.env.kitchen_height),
# 114 + 12 + 4, # TODO calc based on item info
# ),
# dtype=np.float32,
# )
# counter_list = [
# "Counter",
# "CuttingBoard",
# "ServingWindow",
# "Trashcan",
# "Sink",
# "SinkAddon",
# "Stove",
# "DeepFryer",
# "Oven",
# ]
# grid_idxs = [
# (x, y)
# for x in range(int(self.env.kitchen_width))
# for y in range(int(self.env.kitchen_height))
# ]
# # counters do not move
# for counter in self.env.counters:
# grid_idx = np.floor(counter.pos).astype(int)
# counter_name = (
# counter.name
# if isinstance(counter, CookingCounter)
# else (
# repr(counter)
# if isinstance(Counter, Dispenser)
# else counter.__class__.__name__
# )
# )
# assert counter_name in counter_list or counter_name.endswith(
# "Dispenser"
# ), f"Unknown Counter {counter}"
# oh_idx = len(counter_list)
# if counter_name in counter_list:
# oh_idx = counter_list.index(counter_name)
#
# one_hot = [0] * (len(counter_list) + 2)
# one_hot[oh_idx] = 1
# grid_base_array[
# grid_idx[0], grid_idx[1], 4 : 4 + (len(counter_list) + 2)
# ] = np.array(one_hot, dtype=np.float32)
#
# grid_idxs.remove((int(grid_idx[0]), int(grid_idx[1])))
#
# for free_idx in grid_idxs:
# one_hot = [0] * (len(counter_list) + 2)
# one_hot[len(counter_list) + 1] = 1
# grid_base_array[
# free_idx[0], free_idx[1], 4 : 4 + (len(counter_list) + 2)
# ] = np.array(one_hot, dtype=np.float32)
#
# player_info_base_array = np.zeros(
# (
# 4,
# 4 + 114,
# ),
# dtype=np.float32,
# )
# order_base_array = np.zeros((10 * (8 + 1)), dtype=np.float32)
#
# return VectorStateGenerationData(
# grid_base_array=grid_base_array,
# oh_len=12,
# )
#
#
# def get_simple_vectorized_item(self, item: Item) -> npt.NDArray[float]:
# name = item.name
# array = np.zeros(21, dtype=np.float32)
# if item.name.startswith("Burnt"):
# name = name[len("Burnt") :]
# array[0] = 1.0
# if name.startswith("Chopped"):
# array[1] = 1.0
# name = name[len("Chopped") :]
# if name in [
# "PizzaBase",
# "GratedCheese",
# "RawChips",
# "RawPatty",
# ]:
# array[1] = 1.0
# name = {
# "PizzaBase": "Dough",
# "GratedCheese": "Cheese",
# "RawChips": "Potato",
# "RawPatty": "Meat",
# }[name]
# if name == "CookedPatty":
# array[2] = 1.0
# name = "Meat"
#
# if name in self.vector_state_generation.meals:
# idx = 3 + self.vector_state_generation.meals.index(name)
# elif name in self.vector_state_generation.ingredients:
# idx = (
# 3
# + len(self.vector_state_generation.meals)
# + self.vector_state_generation.ingredients.index(name)
# )
# else:
# raise ValueError(f"Unknown item {name} - {item}")
# array[idx] = 1.0
# return array
#
#
# def get_vectorized_item(self, item: Item) -> npt.NDArray[float]:
# item_array = np.zeros(114, dtype=np.float32)
#
# if isinstance(item, CookingEquipment) or item.item_info.type == ItemType.Tool:
# assert (
# item.name in self.vector_state_generation.equipments
# ), f"unknown equipment {item}"
# idx = self.vector_state_generation.equipments.index(item.name)
# item_array[idx] = 1.0
# if isinstance(item, CookingEquipment):
# for s_idx, sub_item in enumerate(item.content_list):
# if s_idx > 3:
# print("Too much content in the content list, info dropped")
# break
# start_idx = len(self.vector_state_generation.equipments) + 21 + 2
# item_array[
# start_idx + (s_idx * (21)) : start_idx + ((s_idx + 1) * (21))
# ] = self.get_simple_vectorized_item(sub_item)
#
# else:
# item_array[
# len(self.vector_state_generation.equipments) : len(
# self.vector_state_generation.equipments
# )
# + 21
# ] = self.get_simple_vectorized_item(item)
#
# item_array[
# len(self.vector_state_generation.equipments) + 21 + 1
# ] = item.progress_percentage
#
# if item.active_effects:
# item_array[
# len(self.vector_state_generation.equipments) + 21 + 2
# ] = 1.0 # TODO percentage of fire...
#
# return item_array
#
#
# def get_vectorized_state_full(
# self, player_id: str
# ) -> Tuple[
# npt.NDArray[npt.NDArray[float]],
# npt.NDArray[npt.NDArray[float]],
# float,
# npt.NDArray[float],
# ]:
# grid_array = self.vector_state_generation.grid_base_array.copy()
# for counter in self.env.counters:
# grid_idx = np.floor(counter.pos).astype(int) # store in counter?
# if counter.occupied_by:
# if isinstance(counter.occupied_by, deque):
# ...
# else:
# item = counter.occupied_by
# grid_array[
# grid_idx[0],
# grid_idx[1],
# 4 + self.vector_state_generation.oh_len :,
# ] = self.get_vectorized_item(item)
# if counter.active_effects:
# grid_array[
# grid_idx[0],
# grid_idx[1],
# 4 + self.vector_state_generation.oh_len - 1,
# ] = 1.0 # TODO percentage of fire...
#
# assert len(self.env.players) <= 4, "To many players for vector representation"
# player_vec = np.zeros(
# (
# 4,
# 4 + 114,
# ),
# dtype=np.float32,
# )
# player_pos = 1
# for player in self.env.players.values():
# if player.name == player_id:
# idx = 0
# player_vec[0, :4] = np.array(
# [
# player.pos[0],
# player.pos[1],
# player.facing_point[0],
# player.facing_point[1],
# ],
# dtype=np.float32,
# )
# else:
# idx = player_pos
#
# if not idx:
# player_pos += 1
# grid_idx = np.floor(player.pos).astype(int) # store in counter?
# player_vec[idx, :4] = np.array(
# [
# player.pos[0] - grid_idx[0],
# player.pos[1] - grid_idx[1],
# player.facing_point[0] / np.linalg.norm(player.facing_point),
# player.facing_point[1] / np.linalg.norm(player.facing_point),
# ],
# dtype=np.float32,
# )
# grid_array[grid_idx[0], grid_idx[1], idx] = 1.0
#
# if player.holding:
# player_vec[idx, 4:] = self.get_vectorized_item(player.holding)
#
# order_array = np.zeros((10 * (8 + 1)), dtype=np.float32)
#
# for i, order in enumerate(self.env.order_manager.open_orders):
# if i > 9:
# print("some orders are not represented in the vectorized state")
# break
# assert (
# order.meal.name in self.vector_state_generation.meals
# ), "unknown meal in order"
# idx = self.vector_state_generation.meals.index(order.meal.name)
# order_array[(i * 9) + idx] = 1.0
# order_array[(i * 9) + 8] = (
# self.env_time - order.start_time
# ).total_seconds() / order.max_duration.total_seconds()
#
# return (
# grid_array,
# player_vec,
# (self.env.env_time - self.env.start_time).total_seconds()
# / (self.env.env_time_end - self.env.start_time).total_seconds(),
# order_array,
# )
# Cooperative Cuisine and Reinforcement Learning
The reinforcement learning folder has four key python files.
1. gym_env.py: This implements the typical reinforcement learning functions: step and reset. Additionally, it calls the state_to_observation converter for the reinforcement learning to learn on the pre-defined representation.
2. train_single_agent.py: This trains a single agent on the pre-defined configs (those are managed with hydra). It also enables multirun or a hyperparameter sweeper, if defined in the rl_config.yaml
3. run_single_agent.py: This enables loading a trained agent and let him play in the environment.
4. play_gym.py: This enables playing a character in the gym yourself. This can be helpful to look at the representation or try out different hooks and their rewards.
There is also a subfolder called: obs_converter where several converters for the conversion of environment to vector representation are defined and can be used in the gym_env. When developing new converters you must pay attention to flatten them properly, as only flattened arrays are properly processed by PPO. Additionally, using the CNNPolicy is only supported for images and not for multi dimensional vectors.
# Overcooked-AI and Cooperative Cuisine
## Use the overcooked-AI levels and configs in cooperative cuisine
All the layouts from overcooked-AI can be used within cooperative cuisine. Dedicated configs are defined and can be loaded via hydra.
The overcooked-ai_environment_config.yaml must be chosen as environment config. Under layout_name any layout from overcooked-AI can be defined.
Additionally, the item_config must be item_info_overcooked-ai.yaml.
With those chosen configs the layouts and rewards from overcooked-AI are used.
## How is the connection between Overcooked-AI and cooperative cuisine defined?
Cooperative Cuisine is highly modular due to the usage of hydra as config manager.
Therefore, the parameters used for overcooked-AI are simply used in the dedicated config file.
The layout format is different, which is why a mapping is defined which converts the overcooked-AI layout into the cooperative cuisine layout.
The layout file has to be present in cooperative_cuisine/reinforcement_learning/layouts/overcooked_ai_layouts.
## Results on the overcooked-AI layouts
As the overcooked-AI project does not include a cutting board and also does not include random environments we were able to replicate the results of overcooked-AI in our environment. We were also able to achieve good performance on several overcooked-AI layouts with random counter placement.
# Experiences with Reinforcement Learning on the cooperative cuisine environment
## Introducing intermediate rewards
The introduction of intermediate rewards is the most important step, as there are 6 possible actions per iteration and a meal might need up to 20 moves in the correct order, which makes the probability for the agent to find this by chance very small. Therefore, small intermediate rewards should be given to stimulate good actions. In comparison to the final rewards, these should however only be small. Especially the usage of the trashcan is difficult to manage as a high negative reward on the trashcan usage might lead to the agent not interacting at all and therefore not learning anything. Not punishing trashcan usage may lead to the agent always cutting and throwing away.
## Shuffle Counter
Having a pre-defined and set environment makes the problem considerably more easy and therefore the agent learns pretty quickly. However, this gives you little information about the representation you are using as the agent might barely refer to the representation but simply learns an order of moves. Therefore, introducing the random shuffeling of counters makes the problem considerable more difficult but also more interesting.
## Increasing environment size
This has a surprisingly large effect as many more steps are necessary to learn an action. Especially, increasing the environment size makes the plating of meals more difficult. Especially in combination with using the cutting board this becomes very complex. If the cutting board is not needed, the agent can handle big environments (also changing ones) quite well as he simply does not need to consider the interaction action.
## Using the cutting board
Using the cutting board (requiring chooped tomatos/lettuce etc.) is very difficult for the agent. Especially, when several cut items are required the agent can become very caught up in the task of cutting and has great difficulties to perform other actions, like cooking, afterwards. Therefore, careful reward shaping is necessary.
## PPO
PPO can be quite unstable and might have good training progress and then suddenly not learn much anymore.
The ent_coef value can be set between 0 and 0.01 and an increased value can help with exploration. Additionally, the game:
time_limit_seconds: can be set to a value between 150-300, depending on the complexity and length of the task it should learn.
A lower value can speed up training if the task does not need many steps.
Good experiences were made with: batch_size: 256, n_envs = 32 and a learning_rate of 0.0006. Additionally, gamma should have a high value as the rewards are quite long-term.
The number of timesteps varies significantly, depending on the complexity of the task (is cutting necessary, how large is the env, etc.)
import cv2
from pearl.action_representation_modules.one_hot_action_representation_module import (
OneHotActionTensorRepresentationModule,
)
from pearl.pearl_agent import PearlAgent
from pearl.policy_learners.sequential_decision_making.deep_q_learning import (
DeepQLearning,
)
from pearl.replay_buffers.sequential_decision_making.fifo_off_policy_replay_buffer import (
FIFOOffPolicyReplayBuffer,
)
from pearl.utils.instantiations.environments.gym_environment import GymEnvironment
from cooperative_cuisine.reinforcement_learning import EnvGymWrapper
custom = True
if custom:
env = GymEnvironment(EnvGymWrapper())
else:
env = GymEnvironment("LunarLander-v2", render_mode="rgb_array")
num_actions = env.action_space.n
agent = PearlAgent(
policy_learner=DeepQLearning(
state_dim=env.observation_space.shape[0],
action_space=env.action_space,
hidden_dims=[64, 64],
training_rounds=20,
action_representation_module=OneHotActionTensorRepresentationModule(
max_number_actions=num_actions
),
),
replay_buffer=FIFOOffPolicyReplayBuffer(10_000),
)
for i in range(40):
print(i)
observation, action_space = env.reset()
agent.reset(observation, action_space)
done = False
while not done:
action = agent.act(exploit=False)
action_result = env.step(action)
agent.observe(action_result)
agent.learn()
done = action_result.done
if custom:
env = GymEnvironment(EnvGymWrapper())
else:
env = GymEnvironment("LunarLander-v2", render_mode="human")
for i in range(40):
print(i)
observation, action_space = env.reset()
agent.reset(observation, action_space)
done = False
while not done:
action = agent.act(exploit=False)
action_result = env.step(action)
agent.observe(action_result)
agent.learn()
done = action_result.done
if custom:
img = env.env.render()
cv2.imshow("image", img[:, :, ::-1])
cv2.waitKey(1)
cooperative_cuisine/reinforcement_learning/visualization/model_small_rl_env.png

57.3 KiB

0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment