EmCoop: A Framework and Benchmark for Embodied Cooperation Among LLM Agents

Agents in Action

MA-Crafter

Multi-agent open-world survival with partial observability and a technology tree. Cooperative tasks are defined over resource collection and crafting, requiring multiple agents to work together.

Action Concept	Args	Effect
`noop`	steps	Do nothing for steps steps.
`move`	dir, steps	Move left/right/up/down.
`collect`	resource	Collect; all qualifying agents receive it.
`craft`	item	Craft at a tool station; requires simultaneous `craft` with a nearby quorum. Resources consumed from leader's bag; all participating agents receive the crafted item.
`sleep`	-	Recover energy.
`place`	object	Place an object in the world.
`share`	id, item, qty	Transfer resources to another agent.
`navigate`	target, id	Pathfind to target (64-step timeout).

MA-Crafter. Open-world cooperative collection and crafting with explicit participation thresholds and tool dependencies.

CUBE. Grid-world cooperative block pushing with embodied constraints from collisions and quorum-based pushing.

CUBE

Block-pushing grid world where blocks have weights that determine both their physical size and the number of agents required to push from a side simultaneously. Spatial-temporal coordination failures emerge.

Action Concept	Args	Effect
`move`	dir, steps	Move in direction for steps steps.
`wait`	steps	Idle for steps steps (STAY).
`push`	block, steps	Push; succeeds only if a sufficient number of agents align on the same face and their combined force meets or exceeds the (possibly chained) block weight, with a free destination cell.

Overview

EmCoop is a benchmark framework for studying cooperation in LLM-based embodied multi-agent systems. We define cooperation as a process: a cooperative task is only completed when it reaches a termination state, and the path to that state - how agent decisions alter the task state step by step - is what we make explicitly observable. By encoding four fundamental cooperation constraints (spatial, temporal, participation, dependency) directly into task structure, every change in constraint state becomes trackable: we can see which constraints are satisfied, which are blocked, and when cooperation breaks down.

Dual-layer Interface MAEIL Cooperation-Constrained Task Observable Cooperation Process

EmCoop at a glance. Dual-layer cognitive–primitive interface, scalable tasks in MA-Crafter and CUBE, and cooperation metrics that quantify dynamics and failure modes. (See paper Figure 1.)

Dual-layer interface

Separates cognitive planning and communication from primitive execution, enabling step-aligned attribution between reasoning, messages, actions, and constraint satisfaction.

Cooperation-constrained tasks

Tasks are temporally extended processes whose progress requires satisfying explicit spatial, temporal, dependency, and participation constraints.

Observable cooperation process

Cooperation is defined as a process toward a termination state. Constraint-encoded tasks make every state transition trackable - revealing where and why coordination breaks down, not just whether it succeeded.

Framework: Dual-layer Interface and MAEIL

To study embodied multi-agent cooperation, we need to capture three intertwined dynamics at once: how agents reason and plan individually, how they communicate and coordinate with each other, and how they jointly interact with a shared physical environment. Our approach - the Multi-Agent Environment Interaction Loop (MAEIL) - models all three through two decoupled time scales:

Cognitive clock $\hat t$ - runs freely between environment steps. Governs reasoning ($\mathsf{R}$), interruption ($\mathsf{I}$), messaging, and plan generation. Advances as many times as needed before any primitive action is taken.

Environment clock $t$ - advances only when all agents have committed a plan and entered $\mathsf{W}$. Governs primitive execution ($\mathsf{X}$) and state transitions $s_t \to s_{t+1}$.

This decoupling lets agents think and communicate asynchronously at their own pace, while the shared environment advances at synchronized step boundaries.

Figure 2: Dual-layer interface and MAEIL stage transitions.

Dual-layer Cognitive-primitive interface

LLMs excel at cognitive-level decisions - reasoning over goals and composing symbolic plans - but are ill-suited to specifying low-level primitive actions directly. EmCoop bridges this gap with two layers:

Primitive layer is modeled by a finite-horizon Dec-POMDP $(I, S, \{\Omega_i\}, \{A_i\}, T)$, where $I=\{1,\dots,n\}$ is the agent set, $S$ the state space, $\Omega_i$ agent observations, $A_i$ primitive actions, and $T$ the horizon.

Cognitive layer captures each agent's state and plan. At cognitive time $\hat t$, agent $i$ holds $x_{i,\hat{t}}=(\mathcal{Q}_{i,\hat{t}},\pi_{i,\hat{t}})$, where $\mathcal{Q}_{i,\hat{t}}\in\{\mathsf{R},\mathsf{X},\mathsf{W},\mathsf{I}\}$ is its MAEIL stage and $\pi_{i,\hat t}$ is its committed plan:

$$\pi_{i,\hat t} = \bigl(\mathcal{T}_i,\;[\hat a_{i,1}(\theta_{i,1}),\dots]\bigr)$$

where $\hat a_{i,k}\in\hat{\mathcal{A}}$ is a high-level action concept and $\theta_{i,k}$ its parameters.

Agents communicate asynchronously within MAEIL via message buffers. Let $\mathcal{M}_{[\hat t,\hat t']}$ denote the multiset of communication events over cognitive interval $[\hat t,\hat t']$, where each event is a message $m=(p,R)$ with payload $p$ and recipient set $R\subseteq I$. Each agent maintains a buffer of delivered but unread messages, updated upon arrival and cleared when read. The interaction loop over this interval induces joint observable trajectories captured as cognitive traces:

$$X_{[\hat t,\hat t']} = \{x_{i,\tau}\}_{i\in I,\;\tau\in[\hat t,\hat t']}, \quad M_{[\hat t,\hat t']} = \mathcal{M}_{[\hat t,\hat t']}$$

which respectively capture agents' internal reasoning and planning dynamics and the full communication trajectory.

Grounding $\Gamma$ bridges the two layers: each $\hat a_{i,k}$ is expanded into primitive actions given the current state $s_t$:

$$\Gamma\bigl(\hat a_{i,k}(\theta_{i,k}),\, s_t\bigr) \mapsto \{a_{i,\tau}\}_{\tau=t:t'}$$

Multi-Agent Environment Interaction Loop (MAEIL)

MAEIL makes the cooperation process observable by defining explicit interaction stages that capture both agent-to-agent dynamics (reasoning, messaging, interruption) and agent-to-environment dynamics (joint execution, embodied interaction). An agent is ready when it has a committed plan and is not in active conversation with others. The environment advances only once all agents are simultaneously ready.

Step 1 - Single agent

Alternates between Reason (R) - freely reason and plan - and Execute (X) - run primitive actions.

R ⇄ X

Step 2 - Multi-agent, no communication

Wait (W) added - an agent enters W (ready) when it has a committed plan and is not in active conversation with others. The environment only advances once all agents are ready.

R, X → W → X (all ready)

Multi-agent with communication

Interrupt (I) added. Agents in R or I may send; any stage may receive:

sender: R, I ──msg──▶ R, X, W, I :receiver

Message arrival interrupts W or X, triggering replanning before re-entering W:

W, X ──msg──▶ I ──ready──▶ W

Cooperation-constrained Task Design

Rather than designing environment-specific success criteria, EmCoop identifies four fundamental cooperation constraints that generalize across a wide variety of embodied tasks. Each constraint characterizes a distinct requirement agents must jointly satisfy to make progress - and together they define the cooperative structure of any task. Let $\mathcal{G}_t$ be the active task set at step $t$; each task $g\in\mathcal{G}_t$ has state $x_{g,t}$ and participating agents $\mathcal{I}_{g,t}\subseteq I$ with group capability state $Z_{g,t}=\{z_{i,t}\}_{i\in\mathcal{I}_{g,t}}$. Each constraint type $c$ is a Boolean predicate $C_g^{c}(x_{g,t}, s_t, Z_{g,t})\in\{0,1\}$:

Spatial ($C_g^{\Delta\ell}$): agents must be co-located or within a radius of the task location.
Temporal ($C_g^{\Delta t}$): actions must occur within a time window, often simultaneously.
Participation ($C_g^{n}$): at least $p(g)$ agents with required capabilities must be involved.
Dependency ($C_g^{d}$): prerequisite environmental conditions or sub-tasks must be satisfied before $g$ can progress.

We track satisfied constraints $\mathcal{C}_{g,t}^{+}=\{c \mid C_g^{c}=1\}$ and violated constraints $\mathcal{C}_{g,t}^{-}=\{c \mid C_g^{c}=0\}$ at every step, turning cooperation from a binary outcome into an observable process of constraint state evolution. Both environments impose a joint participation constraint: letting $N(C_c)(g,t)=\sum_{i\in I}\mathbb{I}[C_c^i(g,t)]$,

$$\mathbb{I}\!\left[\bigwedge_{c\in\{\Delta\ell,\,\Delta t,\,n\}} N(C_c)(g,t)\ge p(g)\right]$$

Constraints are instantiated per environment as:

Constraint	MA-Crafter	CUBE
Task $g$	Collectible resource; $p(g)$ agents `collect` simultaneously within radius $d$	Side $s(g)$ of block $b(g)$; $p(g)=w_{b(g)}$ (block weight)
$C_n^i$ (participation)	$\mathrm{cap}_i(t)\succeq \mathrm{cap}(g,t)$	$\mathrm{cap}_i(t)$ (any agent)
$C_{\Delta\ell}^i$ (spatial)	$\\|x_{i,t}-y_g\\|\le d$	$\min_{y\in\mathcal{Y}_g}\\|x_{i,t}-y\\|\le 1$
$C_{\Delta t}^i$ (temporal)	$a_{i,t}=\mathrm{act}(g)$	$a_{i,t}=\texttt{push}(b(g))$

Observing the Cooperation Process

Cooperation unfolds over two coupled levels: (i) a cognitive phase between environment transitions, in which agents reason and communicate, and (ii) a primitive phase in which each agent commits a primitive action and the environment advances, producing constraint satisfaction signals. Starting from environment state $s_t$, agents enter a reasoning-and-interaction interval $[\hat t,\hat t']$ during which they may communicate, interrupt or revise plans, resume execution, or remain idle. Once all agents reach a ready condition, a joint primitive action $a_t$ is committed and executed, advancing the environment to $s_{t+1}$, where cooperative constraints are evaluated:

$$s_{t+1}=\Bigl(\underbrace{\mathcal{G}_{t+1},\; Z_{t+1}}_{\{T_g\}_{g\in\mathcal{G}_{t+1}}},\; s_{\mathrm{other},t+1}\Bigr) \;\leftarrow\; \left\{\begin{aligned} \underbrace{a_t}_{\substack{\text{joint primitive}\\\text{actions}}} &\leftarrow \underbrace{F_{c}\!\bigl(X_{[\hat t,\hat t']},\;M_{[\hat t,\hat t']}\bigr)}_{\substack{\text{cooperation}\\\text{cognitive activities}}} \\[0.8em] s_t &= \Bigl(\underbrace{\mathcal{G}_t,\; Z_t}_{\{T_g\}_{g\in\mathcal{G}_t}},\;s_{\mathrm{other},t}\Bigr) \end{aligned}\right.$$

Here $X_{[\hat t,\hat t']}$ and $M_{[\hat t,\hat t']}$ summarize what is observable during the between-step cognitive phase, while the transition $s_t\!\to s_{t+1}$ captures what those cognitive activities achieve in embodied execution. Each task $g\in\mathcal{G}_t$ evolves according to a task-specific transition function $T_g$ defined over the primitive environment dynamics, while $Z_t$ represents agent capability states that may change as a result of execution. This separation enables attribution from agent-level MAEIL dynamics, plan events, and communication to environment-level task progress and constraint satisfaction.

Cognitive activities and embodied outcomes. Two metric groups capture key dimensions of this observable cooperation process:

Communication and decision overhead

Tracks message volume and imbalance across agents, capturing bottlenecks and topology-dependent communication costs. Crucially, because MAEIL records every state transition explicitly, we can observe how communication topology and cooperation patterns affect decision latency - the time each agent spends in each MAEIL state before the environment can advance. For example, if agent $j$ sends a message that interrupts agent $i$ mid-execution, triggering a $\mathsf{X}\!\to\!\mathsf{I}\!\to\!\mathsf{W}$ detour, the resulting synchronization cost is directly attributable to that interaction. This makes topology-induced reasoning bottlenecks and coordination delays observable at the per-agent, per-step level.

Plan and execution quality

Tracks plan coherence, interruptions, replanning, and resumption rates, revealing stability vs. conflict during cooperation. Execution quality is assessed via cooperative constraint satisfaction:

$$C^{\mathrm{coop}}_g = C^{\mathrm{d}}_g(x_{g,t}, s_t) \;\wedge\!\!\bigwedge_{c\in\{\Delta\ell,\Delta t,n\}}\!\! C^{c}_g(x_{g,t}, Z_{g,t})$$

with components $C^{\Delta\ell}, C^{\Delta t}, C^{d}, C^{n}$ corresponding to spatial, temporal, dependency, and participation constraints. Evaluating $C^{\mathrm{coop}}_g$ induces $\mathcal{C}^+_{g,t}$ and $\mathcal{C}^-_{g,t}$, directly attributing execution failures to specific constraint types. Capability changes $\Delta Z_t = Z_{t+1} - Z_t$ further capture the realized embodied effects of agents' plans as mediated by environment dynamics.

Trace: constraint satisfaction and capability changes

Trace visualization. Top: cognitive activity and messages. Middle: primitive execution. Bottom: constraint satisfaction and capability changes.

Communication Topologies

Different communication topologies reflect different patterns of agent cooperation - who talks to whom, when, and how often. EmCoop does not prescribe a topology: it captures all agent interactions (messages, interruptions, stage transitions) independently of whatever structure is in use, and works with any topology - fixed, dynamic, or emergent. Studying cooperation under specific topologies is one way to probe how interaction patterns shape outcomes; EmCoop makes those patterns observable regardless.

Individual: no communication - agents act from private observations only.
Debate: agents speak in order, broadcasting to others before planning.
Centralized: a leader broadcasts directives and waits for follower acknowledgments.
Decentralized: free communication with bounded per-step budgets.

Topology intervention. Compare cooperation outcomes and failure modes under controlled communication structure changes.

Experiments and Findings

RQ1: Topology effects

Centralized settings often increase plan coherence, while decentralized settings raise communication load. Individual baselines reduce interruptions and overhead by removing communication entirely.

RQ2: Where breakdown occurs

As difficulty increases, participation constraints become harder to satisfy and spatio-temporal misalignment grows, enabling attribution of failure to specific constraint types rather than only success rates.

Figure 3: Radar plots and bars for metrics across topologies

Metrics reveal tradeoffs. Communication topology impacts coherence, stability, overhead, and message load, and these effects grow with difficulty and team size. (See paper Figure 3.)

RQ3: Feedback helps

Adding explicit environment feedback to leader messages improves feasibility checks and helps teams collect multiple resources in hard settings by exposing participation and dependency snapshots.

Figure 4: Feedback case study timeline (MA-Crafter)

Environment feedback. Leader attaches constraint and dependency snapshots to messages, improving coordination under hard settings.

Takeaway

Cooperation cannot be faithfully captured by any single metric - it is a process. By explicitly modeling and observing the cooperation process, EmCoop lets you define any metric, run any analysis, and diagnose any failure mode that matters to you.

BibTeX

@article{yang2026emcoop,
  title   = {EmCoop: A Framework and Benchmark for Embodied Cooperation Among LLM Agents},
  author  = {Yang, Hanqing and Chen, Shiyu and Nourzad, Narjes and Siew, Marie and Chen, Jingdi and Joe-Wong, Carlee},
  year    = {2026},
  note    = {Preprint}
}