CUBE is a lightweight and scalable multi agent block pushing environment for studying cooperation in embodied settings. Agents share a grid world, pushing weighted blocks toward a goal region while dealing with congestion, collisions, force requirements, and changing geometry.
The environment follows the PettingZoo parallel API and is parameterized by a single integer n,
which sets grid side length, number of agents, and the distribution of block weights.
This creates a transparent curriculum from small to large teams.
Details about embodied constraints, symbolic actions, and the dual layer interface appear in later sections.
n in a predictable way while tasks at the same n remain comparable.
We want to study cooperative intelligence in LLM agents at scale.
Traditional reinforcement learning benchmarks emphasize low level action spaces and scalar rewards. For LLM agents, emitting long sequences of primitive moves and waiting for a numerical reward is slow and unnatural.
Symbolic planning domains provide clean abstractions and logical structure but often assume deterministic transitions and ignore embodied dynamics.
CUBE sits between these two paradigms. It wraps primitive block pushing into a symbolic vocabulary while grounding every symbolic action in an embodied grid world. This makes it a single testbed where LLM planners, RL policies, and hybrid systems can be studied all together.
Embodied settings introduce structured dependencies. Moving one block can make another reachable, block off corridors, or create new chains that require multiple agents. A strategy that looks good at the symbolic level can fail once the physical layout and timing are taken into account.
This combination of spatial and temporal constraints creates rich cooperation challenges that purely symbolic or abstract domains do not capture. CUBE is therefore a natural platform for studying multi agent cooperation, communication, task allocation, and emergent group behavior for LLM based and learned agents alike.
Agent chains, block chains, and block-pushing in CUBE.
Human reasoning blends symbolic planning with embodied feedback. We imagine abstract plans, act, compare outcomes with expectations, and adjust. CUBE lets LLM agents follow the same loop: use symbolic structure to reason about goals and relations, then update plans based on the concrete consequences of their actions as the environment evolves.
The environment exposes two aligned views of the same world: a primitive layer that handles low level grid dynamics, and a symbolic layer that provides compact, expressive actions and concepts for planning.
Action.
At the primitive layer, each agent chooses from {STAY, UP, DOWN, LEFT, RIGHT}.
These actions determine motion, contact, and force on blocks.
At the symbolic layer, agents issue macro actions such as
move, move_to_block, rendezvous, push_block,
yield_block, idle, and wait_agents.
Each symbolic action unfolds into a sequence of primitive moves during runtime.
Symbolic actions available in CUBE.
How symbolic actions unroll into primitives in runtime.
Observation. The primitive view is a multi channel grid with agent locations, block weights, block ids, and the goal zone. The symbolic view is a structured dictionary that stores grid size, agent positions, block properties such as size, location, and distance to the goal, together with a history of symbolic actions and their status.
Feedback. CUBE provides a scalar reward with step cost and shared delivery reward, plus a library of symbolic concepts for shaping richer feedback. Examples include distance between entities, whether an agent is aligned with a block face, how many agents are aligned, block progress toward the goal, quorum status, and whether a block is currently blocked.
Symbolic concepts for feedback.
Symbolic plan.
Plan. Symbolic actions are compact but highly expressive. Their arguments select blocks, faces, directions, and time horizons, which creates a rich planning space for LLM agents. Plans become short programs written in this vocabulary, while the primitive layer ensures that every step is grounded in an uncertain, continuously changing environment.
Embodied movement introduces uncertainty during execution. A plan that is feasible at one step can become infeasible a few steps later as blocks move, faces disappear, agents occlude each other, or key cells are taken by another agent. These effects do not appear in purely symbolic domains and are a central reason why cooperation in CUBE remains challenging.
Embodied updates between step k and step k + 1 can remove useful faces and turn a previously feasible plan into an infeasible one.
Occlusion by agents. Moving agents can temporarily block the remaining staging cells, preventing others from reaching useful positions.
Corner block. A block pushed into a corner has no usable faces, so it is not actionable even if agents are available nearby.
Cell access race. Several agents move toward the same key cell. Only one succeeds, leaving others misaligned and wasting cooperative effort.
A single integer n controls the entire task family.
k = max(20, n).n agents placed along the wall opposite the goal region.⌊n/2⌋ + 1 down to one, with lighter blocks appearing more often.
Larger n increases both congestion and the quorum needed to move heavier blocks.
Layouts at a fixed n differ but have similar cooperative complexity, which provides
a clear curriculum from small groups to large teams.
Grid size, agent count, and block distribution scale with the curriculum parameter n.
The heuristic baseline follows a greedy strategy.
At each stage it selects the block closest to the goal zone and assigns agents to move it,
issuing symbolic instructions such as move_to_block, rendezvous, and
push_block until the block is delivered.
The baseline produces consistent cooperative behavior without adaptive cooperation strategies, and serves as a reference point for more advanced agents.
Number of completed blocks versus agent count n for language agents and the heuristic baseline.
As language based baselines, the paper evaluates LLM agents in a zero shot style setting. Each agent repeatedly receives a symbolic observation and generates short plans written in the CUBE action vocabulary, with a prompt that encodes a simple rule to always target the block closest to the goal zone.
These naive LLM agents can generate executable plans but perform inconsistently, particularly when they must rely on other agents. Smaller models show high variance and longer runtimes, suggesting frequent replanning and difficulty with coordination as n grows.
Average steps per episode as n increases (capped at 200 steps).
Average runtime at different difficulty levels by baselines, highlighting that LLM inference dominates runtime.
Overall, the heuristic baseline consistently completes all blocks at the tested scales, whereas naive LLM agents reveal a cooperation gap. They can express nontrivial symbolic behavior yet fall short of robust cooperative performance as team size grows. This motivates richer designs for embodied LLM agents that combine symbolic world models, communication, and learning.
Environment step time versus agent count.
Memory usage versus agent count.
CPU utilization versus agent count.
Symbolic action overhead versus agent count.
Symbolic action overhead.
In contrast, LLM inference time is relatively large. Generating even short plans takes hundreds of milliseconds to seconds, which makes environment overhead negligible for studies of embodied language agents.
As agents move blocks they create new chains, tighten corridors, or close off some paths, which can make later deliveries easier or harder. Agents must reason about how current choices change future task difficulty and may need to revise decompositions on the fly.
Successful teams need to understand how blocks interact, which sides can be used for approach, and how block chains create soft dependencies between tasks. Simple distance based rules are often not enough when heavy blocks and narrow passages interact.
Many pushes require several agents to reach specific faces of a block at similar times. Rendezvous and waiting behavior must be coordinated with path planning and congestion, and agents need to recover when timing assumptions fail.
Agents rarely know exactly what teammates will do. They must form expectations about others, infer goals from movement, and decide when to wait, yield, or reroute. Misjudgments can temporarily lock tasks, such as when a single agent stands in the only useful staging cell near a block.
Agents may execute plans with different horizons and therefore fall out of sync. Centralized planners, decentralized policies, and communication protocols all have to cope with this asynchrony and design strategies that are robust to partial or delayed execution.
CUBE opens space for methods that blend symbolic world models, model predictive control, multi agent reinforcement learning, and LLM based planning. It also supports comparison between centralized planners and decentralized designs that rely on communication, emerging conventions, or shared tools for cooperation.
@inproceedings{yangcube,
title = {CUBE: Collaborative Multi-Agent Block-Pushing Environment for Collective Planning with LLM Agents},
author = {Yang, Hanqing and Nourzad, Narjes and Chen, Shiyu and Joe-Wong, Carlee},
booktitle = {Workshop on Scaling Environments for Agents},
year = {2025}
}
Please also feel free to check our other work, DR WELL, where decentralized LLM agents cooperate in CUBE.
@inproceedings{nourzad2025dr,
title={DR. WELL: Dynamic Reasoning and Learning with Symbolic World Model for Embodied LLM-Based Multi-Agent Collaboration},
author={Nourzad, Narjes and Yang, Hanqing and Chen, Shiyu and Joe-Wong, Carlee},
booktitle={NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning},
year={2025}
}