CUBE: Collaborative Multi-Agent Block-Pushing Environment for Collective Planning with LLM Agents

Embodied grid world where symbolic plans are grounded in physical interaction, designed to study cooperative intelligence at scale.

Hanqing Yang*1, Narjes Nourzad*†2, Shiyu Chen1, Carlee Joe-Wong1

1Carnegie Mellon University
2University of Southern California
NeurIPS 2025 Workshop: Scaling Environments for Agents (SEA)

*Equal Contribution    Work done during an internship at Carnegie Mellon University
CUBE Teaser

CUBE is a grid world where teams push weighted blocks into a goal zone while respecting embodied constraints. A single scaling parameter n jointly sets team size, block weights, and grid size, creating a transparent curriculum from small to large scale cooperation. Each panel illustrates a snapshot of a cooperative block pushing scenario at increasing scales (n from 2 to 256) under a simple always move right policy.

What is CUBE

CUBE is a lightweight and scalable multi agent block pushing environment for studying cooperation in embodied settings. Agents share a grid world, pushing weighted blocks toward a goal region while dealing with congestion, collisions, force requirements, and changing geometry.

The environment follows the PettingZoo parallel API and is parameterized by a single integer n, which sets grid side length, number of agents, and the distribution of block weights. This creates a transparent curriculum from small to large teams. Details about embodied constraints, symbolic actions, and the dual layer interface appear in later sections.

Key Features

  • Scalable and lightweight. Supports hundreds of agents on a single CPU core.
  • Embodied cooperation. Movement, force, block size, agent chain, block chain, congestion, and collisions matter for whether plans succeed.
  • Dual layer interface. Both symbolic and primitive interface so LLM agents, RL agents, and hybrid systems can all use the same environment.
  • Single parameter curriculum. Difficulty grows with n in a predictable way while tasks at the same n remain comparable.

Why CUBE

We want to study cooperative intelligence in LLM agents at scale.

Traditional reinforcement learning benchmarks emphasize low level action spaces and scalar rewards. For LLM agents, emitting long sequences of primitive moves and waiting for a numerical reward is slow and unnatural.

Symbolic planning domains provide clean abstractions and logical structure but often assume deterministic transitions and ignore embodied dynamics.

CUBE sits between these two paradigms. It wraps primitive block pushing into a symbolic vocabulary while grounding every symbolic action in an embodied grid world. This makes it a single testbed where LLM planners, RL policies, and hybrid systems can be studied all together.

Why Embodied Tasks

Embodied settings introduce structured dependencies. Moving one block can make another reachable, block off corridors, or create new chains that require multiple agents. A strategy that looks good at the symbolic level can fail once the physical layout and timing are taken into account.

This combination of spatial and temporal constraints creates rich cooperation challenges that purely symbolic or abstract domains do not capture. CUBE is therefore a natural platform for studying multi agent cooperation, communication, task allocation, and emergent group behavior for LLM based and learned agents alike.

Embodied Constraints in CUBE

  • Discrete grid occupancy. Agents occupy grid cells with at most one entity per cell. Each block has integer weight equal to the force needed to move it and occupies a square whose side length equals that weight.
  • Forces and weights. Each agent exerts one unit of force in its movement direction.
  • Block chains. When blocks sit in front of one another in a push direction they form a chain. A chain moves as a composite object when total applied force at the leading face meets or exceeds the sum of weights of all blocks and all destination cells are free.
  • Agent chains. When agents line up behind a block and push in the same direction, the effective force equals the number of aligned agents. If this meets the chain weight, the block chain and agents advance together; otherwise nothing moves.
  • Collisions and races. Moves that target occupied cells fail. When several agents aim at the same free cell, a single agent succeeds and others remain in place, creating cell access races that affect how plans unfold.
Embodied constraints in CUBE
Successful chain
Failed chain

Agent chains, block chains, and block-pushing in CUBE.

Dual Layer Environment for LLM Agents

Human reasoning blends symbolic planning with embodied feedback. We imagine abstract plans, act, compare outcomes with expectations, and adjust. CUBE lets LLM agents follow the same loop: use symbolic structure to reason about goals and relations, then update plans based on the concrete consequences of their actions as the environment evolves.

The environment exposes two aligned views of the same world: a primitive layer that handles low level grid dynamics, and a symbolic layer that provides compact, expressive actions and concepts for planning.

Action. At the primitive layer, each agent chooses from {STAY, UP, DOWN, LEFT, RIGHT}. These actions determine motion, contact, and force on blocks. At the symbolic layer, agents issue macro actions such as move, move_to_block, rendezvous, push_block, yield_block, idle, and wait_agents. Each symbolic action unfolds into a sequence of primitive moves during runtime.

Symbolic action table

Symbolic actions available in CUBE.

Symbolic action unroll

How symbolic actions unroll into primitives in runtime.

Observation. The primitive view is a multi channel grid with agent locations, block weights, block ids, and the goal zone. The symbolic view is a structured dictionary that stores grid size, agent positions, block properties such as size, location, and distance to the goal, together with a history of symbolic actions and their status.

Feedback. CUBE provides a scalar reward with step cost and shared delivery reward, plus a library of symbolic concepts for shaping richer feedback. Examples include distance between entities, whether an agent is aligned with a block face, how many agents are aligned, block progress toward the goal, quorum status, and whether a block is currently blocked.

Symbolic concept examples

Symbolic concepts for feedback.

Symbolic plan

Symbolic plan.

Plan. Symbolic actions are compact but highly expressive. Their arguments select blocks, faces, directions, and time horizons, which creates a rich planning space for LLM agents. Plans become short programs written in this vocabulary, while the primitive layer ensures that every step is grounded in an uncertain, continuously changing environment.

Embodied Cooperation Failure Modes

Embodied movement introduces uncertainty during execution. A plan that is feasible at one step can become infeasible a few steps later as blocks move, faces disappear, agents occlude each other, or key cells are taken by another agent. These effects do not appear in purely symbolic domains and are a central reason why cooperation in CUBE remains challenging.

Plan at step k
Plan at step k plus one

Embodied updates between step k and step k + 1 can remove useful faces and turn a previously feasible plan into an infeasible one.

Occlusion failure

Occlusion by agents. Moving agents can temporarily block the remaining staging cells, preventing others from reaching useful positions.

Corner block failure

Corner block. A block pushed into a corner has no usable faces, so it is not actionable even if agents are available nearby.

Cell access race

Cell access race. Several agents move toward the same key cell. Only one succeeds, leaving others misaligned and wasting cooperative effort.

Controllable Difficulty Curriculum

A single integer n controls the entire task family.

  • Grid side length set by k = max(20, n).
  • n agents placed along the wall opposite the goal region.
  • Blocks with weights from ⌊n/2⌋ + 1 down to one, with lighter blocks appearing more often.

Larger n increases both congestion and the quorum needed to move heavier blocks. Layouts at a fixed n differ but have similar cooperative complexity, which provides a clear curriculum from small groups to large teams.

Difficulty curriculum controlled by n

Grid size, agent count, and block distribution scale with the curriculum parameter n.

Baselines and Task Performance

Heuristic baseline

The heuristic baseline follows a greedy strategy. At each stage it selects the block closest to the goal zone and assigns agents to move it, issuing symbolic instructions such as move_to_block, rendezvous, and push_block until the block is delivered. The baseline produces consistent cooperative behavior without adaptive cooperation strategies, and serves as a reference point for more advanced agents.

Completed blocks vs number of agents

Number of completed blocks versus agent count n for language agents and the heuristic baseline.

Naive language agents

As language based baselines, the paper evaluates LLM agents in a zero shot style setting. Each agent repeatedly receives a symbolic observation and generates short plans written in the CUBE action vocabulary, with a prompt that encodes a simple rule to always target the block closest to the goal zone.

These naive LLM agents can generate executable plans but perform inconsistently, particularly when they must rely on other agents. Smaller models show high variance and longer runtimes, suggesting frequent replanning and difficulty with coordination as n grows.

Average steps vs n by model

Average steps per episode as n increases (capped at 200 steps).

Runtime vs n by model

Average runtime at different difficulty levels by baselines, highlighting that LLM inference dominates runtime.

Overall, the heuristic baseline consistently completes all blocks at the tested scales, whereas naive LLM agents reveal a cooperation gap. They can express nontrivial symbolic behavior yet fall short of robust cooperative performance as team size grows. This motivates richer designs for embodied LLM agents that combine symbolic world models, communication, and learning.

Scalability and Computational Overhead

Mean time per step vs agents

Environment step time versus agent count.

Process memory usage vs agents

Memory usage versus agent count.

CPU utilization vs agents

CPU utilization versus agent count.

Runtime vs agents by action type

Symbolic action overhead versus agent count.

Action runtime heatmap

Symbolic action overhead.

In contrast, LLM inference time is relatively large. Generating even short plans takes hundreds of milliseconds to seconds, which makes environment overhead negligible for studies of embodied language agents.

Challenges and Opportunities

Dynamic scene and task

As agents move blocks they create new chains, tighten corridors, or close off some paths, which can make later deliveries easier or harder. Agents must reason about how current choices change future task difficulty and may need to revise decompositions on the fly.

Spatial reasoning

Successful teams need to understand how blocks interact, which sides can be used for approach, and how block chains create soft dependencies between tasks. Simple distance based rules are often not enough when heavy blocks and narrow passages interact.

Synchronization

Many pushes require several agents to reach specific faces of a block at similar times. Rendezvous and waiting behavior must be coordinated with path planning and congestion, and agents need to recover when timing assumptions fail.

Collective intelligence under uncertainty

Agents rarely know exactly what teammates will do. They must form expectations about others, infer goals from movement, and decide when to wait, yield, or reroute. Misjudgments can temporarily lock tasks, such as when a single agent stands in the only useful staging cell near a block.

Asynchronization

Agents may execute plans with different horizons and therefore fall out of sync. Centralized planners, decentralized policies, and communication protocols all have to cope with this asynchrony and design strategies that are robust to partial or delayed execution.

Research directions

CUBE opens space for methods that blend symbolic world models, model predictive control, multi agent reinforcement learning, and LLM based planning. It also supports comparison between centralized planners and decentralized designs that rely on communication, emerging conventions, or shared tools for cooperation.

Poster

BibTeX

@inproceedings{yangcube,
  title     = {CUBE: Collaborative Multi-Agent Block-Pushing Environment for Collective Planning with LLM Agents},
  author    = {Yang, Hanqing and Nourzad, Narjes and Chen, Shiyu and Joe-Wong, Carlee},
  booktitle = {Workshop on Scaling Environments for Agents},
  year      = {2025}
}
Please also feel free to check our other work, DR WELL, where decentralized LLM agents cooperate in CUBE.
@inproceedings{nourzad2025dr,
  title={DR. WELL: Dynamic Reasoning and Learning with Symbolic World Model for Embodied LLM-Based Multi-Agent Collaboration},
  author={Nourzad, Narjes and Yang, Hanqing and Chen, Shiyu and Joe-Wong, Carlee},
  booktitle={NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning},
  year={2025}
}