CUBE: Collaborative Multi-Agent Block-Pushing Environment for Collective Planning with LLM Agents

A PettingZoo compatible grid world where symbolic plans are grounded in embodied interaction, with a single parameter n that controls team size, block spectrum, and task difficulty.

Hanqing Yang^*1, Narjes Nourzad^*†2, Shiyu Chen¹, Carlee Joe-Wong¹

¹Carnegie Mellon University
²University of Southern California
NeurIPS 2025 Workshop: Scaling Environments for Agents (SEA)
^*Equal Contribution ^† Work done during an internship at Carnegie Mellon University

Code (soon) Paper Poster

CUBE is a grid world where teams push weighted blocks into a goal zone while respecting embodied constraints. A single scaling parameter n jointly sets team size, block weights, and grid size, creating a transparent curriculum from small to large scale cooperation. Each panel illustrates a snapshot of a cooperative block pushing scenario at increasing scales (n from 2 to 256) under a simple always move right policy.

What is CUBE

CUBE is a lightweight, portable, and scalable multi agent environment that unifies symbolic reasoning with embodied interaction. Agents operate in a shared grid world and must push weighted blocks into a goal zone while avoiding collisions, congestion, and unproductive block chains. These spatial dependencies make cooperation necessary and observable.

At its base level CUBE is built on the PettingZoo parallel API, so it can plug directly into existing multi agent reinforcement learning pipelines. The default episode configuration is generated automatically from a single integer n. The grid side length is k = max(20, n), there are n agents, and blocks have integer weights from ⌊n/2⌋ + 1 down to one, with lighter blocks appearing in greater numbers. Roughly half of the grid is covered by blocks. Agents start along the wall opposite the goal region, and blocks are placed away from walls and corners so that every block is movable by pushing.

The parameter n therefore sets both environment scale and cooperative difficulty. As n increases, heavier blocks require larger agent quorums and more complex block chains, and congestion grows since each agent contributes only unit force. Tasks at the same n have comparable complexity, which supports fair comparison of cooperative strategies across methods.

Embodied multi agent cooperation. Agents share a grid, occupy discrete cells, and exert one unit of force when they move. Blocks have weights proportional to their side length and can be pushed from any face. When agents align on a block face, their forces combine to move the block or a chain of blocks.
Dual layer design. CUBE couples a symbolic layer and a vector based layer for observation, action, and feedback. Symbolic plans are executed as sequences of primitive movements, while the environment returns both low level state and structured symbolic feedback that summarize progress and outcomes.
Controllable curriculum. A single parameter n sets grid size, agent count, and block distribution. Larger n increases the effort required for cooperative success while keeping difficulty predictable and reproducible.
Lightweight environment. CUBE is implemented in Python with Numba acceleration and maintains low runtime and memory usage even with hundreds of agents, so environment cost stays small compared with LLM inference.

Environment at a glance

CUBE is presented with four main contributions.

Scalable environment. A lightweight, portable environment that supports variation in both number of agents and task difficulty.
Reproducible curriculum. A difficulty curriculum governed by a single parameter n that enables systematic study of cooperation across comparable difficulty levels while still allowing custom scenarios.
Dual interface. A paired symbolic and vector based interface for observation, action, and feedback that supports reinforcement learning agents, LLM agents, and hybrids.
Baseline and scalability study. Empirical evaluation that profiles runtime, memory, and CPU usage, and compares heuristic and LLM based planners.

Why CUBE

Traditional reinforcement learning benchmarks emphasize low level action spaces and scalar rewards. These signals are useful for gradient based training but give little support for symbolic reasoning, interpretability, or debugging. For LLM agents, emitting long sequences of primitive moves and waiting for numerical rewards is unnatural and inefficient.

Symbolic planning domains have clear preconditions and effects but usually assume deterministic transitions and ignore embodied dynamics such as collisions, congestion, or force accumulation. They are therefore not enough on their own to study embodied LLM agents that must reason in uncertain, interactive worlds.

CUBE bridges this gap by wrapping primitive block pushing actions into a symbolic vocabulary and linking that vocabulary to an embodied grid world. This lets LLM agents plan using interpretable, compositional actions while still having to cope with uncertainty, races for cells, and changing geometry.

Why embodied tasks

Human reasoning blends symbolic and embodied perspectives. We imagine plans, act, compare outcomes with expectations, and revise our internal models. CUBE is built to support a similar loop for LLM agents. Plans are expressed in symbolic actions, executed as primitive moves, and evaluated with both scalar and structured feedback.

Embodied constraints keep tasks challenging as team size increases. Moving any block or agent reshapes the scene. Agents must cope with tasks that can disappear, delivery paths that become blocked by other blocks, and sequences where one agent can temporarily make a block unreachable.

This makes CUBE a natural testbed for research on cooperative intelligence at scale, including communication, task allocation, and group level planning.

Embodied Constraints in CUBE

Discrete occupancy. Agents and blocks live on grid cells, one entity per cell.
Forces and weights. Each agent exerts one unit of force in its movement direction. Each block has integer weight equal to the force needed to move it, and occupies a square whose side length equals that weight.
Block chains. When blocks sit in front of one another along a push direction they form a chain. The chain moves as a composite object when total applied force at the leading face meets or exceeds the sum of weights of all blocks and all destination cells are free.
Agent chains. When agents line up behind a block and push in the same direction, the effective force equals the number of aligned agents. If this meets the chain weight, the block chain and agents advance together; otherwise nothing moves.
Collisions and races. Agent moves that target occupied cells fail. When several agents aim at the same free cell, the agent with smallest index claims it. This creates cell access races where one agent succeeds and others must rethink their plans.

Agent chains, block chains, and forces in CUBE.

Observation, Action, and Feedback Interfaces

Symbolic actions unfold into sequences of primitive movements that operate on the shared grid.

Observation

CUBE exposes two complementary observations.

Symbolic observation. Per agent dictionaries that include global grid size, positions of all agents, and compact summaries of each block such as id, weight, position, and distance to the goal column. The history of symbolic actions, their primitive expansions, and their status is also recorded, which supports reasoning about what has been tried so far.
Multi channel observation. A five channel grid with agent locations, block weights, goal column, agent index per cell, and block id per cell, intended for reinforcement learning encoders and visualizations.

Action

The action interface has two layers.

Primitive actions. Each agent chooses from {STAY, UP, DOWN, LEFT, RIGHT}. Movements succeed only when the target cell is free, and pushes succeed only when aligned agents provide enough force and the front cell is empty.
Symbolic actions. A small library of parameterized actions such as move, move_to_block, rendezvous, push_block, yield_block, idle, and wait_agents. Each action compiles to primitive moves until a condition is met. Agents submit short plans that are sequences of such actions with arguments, which then execute in order.

Feedback and symbolic concepts

CUBE provides both scalar rewards and a library of symbolic concepts that help define custom metrics. The default reward has a small step cost and a shared delivery reward proportional to delivered block weight. Symbolic concepts include utilities such as functions that measure distances, count aligned agents, summarize block progress, measure quorum deficit, and detect blocking, which can be combined to design adaptive feedback or evaluation signals.

Chains and Embodied Failure Modes

In CUBE, blocks can form chains that require joint force from several agents. A pushing line forms when agents align along a block face, each contributing unit force in the push direction. Motion succeeds only when the total applied force meets or exceeds the chain weight and the frontmost destination cell is free.

Valid tasks can disappear dynamically as blocks move. For example, a face of a block that was previously reachable can vanish when another block moves adjacent to it. Agents that were committed to that face must then recover and replan. Cell races and boundary geometry introduce further failure cases, such as blocks that become actionless along walls because they can only be pushed, not pulled.

Successful pushing chain with agents aligned

Successful chain. Aligned agents form a stable pushing line and move a composite block chain.

Geometric failure. Misalignment or an occupied destination cell prevents progress.

Failure I. Agents cannot stage on the target face due to congestion, so no joint push can form.

Failure II. Agents block one another or oscillate, preventing stable cooperation despite a feasible strategy.

Controllable Difficulty Curriculum

A single integer n controls the entire task family.

Grid side length set by k = max(20, n).
n agents placed along the wall opposite the goal region.
Blocks with weights from ⌊n/2⌋ + 1 down to one, with lighter blocks appearing more often.

Larger n increases both congestion and the quorum needed to move heavier blocks. Layouts at a fixed n differ but have similar cooperative complexity, which provides a clear curriculum from small groups to large teams.

Grid size, agent count, and block distribution scale with the curriculum parameter n.

Baselines and Task Performance

Heuristic baseline

The heuristic baseline follows a greedy strategy. At each stage it selects the block closest to the goal zone and assigns agents to move it, issuing symbolic instructions such as move_to_block, rendezvous, and push_block until the block is delivered. The baseline produces valid cooperative behavior without explicitly optimizing path length or resolving congestion, and serves as a reference point for more advanced agents.

Number of completed blocks versus agent count n for language agents and the heuristic baseline.

Naive language agents

As language based baselines, the paper evaluates two LLM agents in a zero shot style setting. Each agent repeatedly receives a symbolic observation and generates short plans written in the CUBE action vocabulary, with a prompt that encodes a simple rule to always target the block closest to the goal zone.

These naive LLM agents can generate executable plans but perform inconsistently, particularly when they must rely on other agents. Smaller models show high variance and longer runtimes, suggesting frequent replanning and difficulty with coordination as n grows.

Average steps per episode as n increases (capped at 200 steps).

Average runtime versus n by model, highlighting that LLM inference dominates runtime.

Overall, the heuristic baseline consistently completes all blocks at the tested scales, whereas naive LLM agents reveal a cooperation gap. They can express nontrivial symbolic behavior yet fall short of robust cooperative performance as team size grows. This motivates richer designs for embodied LLM agents that combine symbolic world models, communication, and learning.

Scalability and Computational Overhead

Mean environment step time grows smoothly with n, from small fractions of a millisecond for small teams to tens of milliseconds at 256 agents on one CPU core.

Memory usage increases approximately linearly, reaching under a gigabyte at n equal to 256, with tens of megabytes at small scales.

CPU utilization rises steadily and stays within a manageable range even with 256 agents, so the environment remains lightweight compared with LLM inference.

Mean runtime per symbolic action versus agent count. Most actions stay in a narrow band of run times even at larger n, while actions that inspect many environment objects become more expensive but remain efficient.

Heatmap of symbolic action runtimes across scales.

In contrast, LLM inference time is relatively large. Generating even short plans takes hundreds of milliseconds to seconds, which makes environment overhead negligible for studies of embodied language agents.

Challenges and Opportunities

Dynamic scene and task

As agents move blocks they create new chains, tighten corridors, or close off some paths, which can make later deliveries easier or harder. Agents must reason about how current choices change future task difficulty and may need to revise decompositions on the fly.

Spatial reasoning

Successful teams need to understand how blocks interact, which sides can be used for approach, and how block chains create soft dependencies between tasks. Simple distance based rules are often not enough when heavy blocks and narrow passages interact.

Synchronization

Many pushes require several agents to reach specific faces of a block at similar times. Rendezvous and waiting behavior must be coordinated with path planning and congestion, and agents need to recover when timing assumptions fail.

Collective intelligence under uncertainty

Agents rarely know exactly what teammates will do. They must form expectations about others, infer goals from movement, and decide when to wait, yield, or reroute. Misjudgments can lock tasks temporarily, such as when a single agent stands in the only useful staging cell near a block.

Asynchronization

Agents may execute plans with different horizons and therefore fall out of sync. Centralized planners, decentralized policies, and communication protocols all have to cope with this asynchrony and design strategies that are robust to partial or delayed execution.

Research directions

CUBE opens space for methods that blend symbolic world models, model predictive control, multi agent reinforcement learning, and LLM based planning. It also supports comparison between centralized planners and decentralized designs that rely on communication, emerging conventions, or shared tools for cooperation.

Poster

BibTeX

@inproceedings{yangcube,
  title     = {CUBE: Collaborative Multi-Agent Block-Pushing Environment for Collective Planning with LLM Agents},
  author    = {Yang, Hanqing and Nourzad, Narjes and Chen, Shiyu and Joe-Wong, Carlee},
  booktitle = {Workshop on Scaling Environments for Agents},
  year      = {2025}
}

Please also feel free to check our other work, DR WELL, where decentralized LLM agents cooperate in CUBE.

@inproceedings{nourzad2025dr,
  title={DR. WELL: Dynamic Reasoning and Learning with Symbolic World Model for Embodied LLM-Based Multi-Agent Collaboration},
  author={Nourzad, Narjes and Yang, Hanqing and Chen, Shiyu and Joe-Wong, Carlee},
  booktitle={NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning},
  year={2025}
}