Benchmarking Language Agents on Open-Ended Multi-Agent Coordination in Game Worlds

6 Jun, 2026·

Kale-Ab Tessera

Andras Szecsenyi

Cameron Barker

Alexander Rutherford

Davide Paglieri

Aidan Scannell

Henry Gouk

Elliot J. Crowley

Tim Rocktäschel

Amos Storkey

· 0 min read

PDF Code DOI

Abstract

As language models are increasingly deployed as autonomous agents, they will need to coordinate with others in long-horizon, open-ended interactive tasks. Yet current evaluations rarely test these demands together, focusing instead on short interactions, single-agent open-ended tasks, or highly structured multi-agent settings. We introduce $alem$, a JAX-based, procedurally generated open-ended benchmark for multi-agent coordination built on Craftax-like dynamics. Alem embeds procedurally generated coordination tasks, soft specialisation, communication, and controllable coordination difficulty into a long-horizon, game-like survival world with exploration, crafting, trading, and combat. We evaluate $13$ modern LLMs zero-shot within homogeneous teams, with trained MARL agents as reference points. Most LLM agents struggle, averaging only ~6% of maximum reward, but their failures are not uniform. Gemini-3.1-Pro-High approaches MARL agents trained for one billion steps on Hard coordination ($17.5%$ vs. $17.6%$ Coord.%), while GPT-5.4-High achieves strong base-task reward but much lower coordination reward. This contrast shows both the promise and the limitation of frontier LLM agents - a zero-shot model can approach trained MARL coordination performance, yet individual task competence does not imply coordination competence. Ablations further show that communication is central for sharing intent, while memory and reasoning help when they support multi-step planning. These results identify coordination as a distinct bottleneck for current LLM agents, separate from single-agent capabilities. Alem makes this bottleneck measurable, providing a controlled setting for developing agents that can communicate, allocate roles, and execute shared plans.

Type

Publication

arXiv preprint arXiv:2606.08340

Last updated on 9 Jun, 2026

Authors

Aidan Scannell (he/him)

Research Associate

Contextual Latent World Models for Offline Meta Reinforcement Learning 3 Mar, 2026 →