Benchmarking Language Agents on Open-Ended Multi-Agent Coordination in Game Worlds

6 Jun, 2026·
Kale-Ab Tessera
,
Andras Szecsenyi
,
Cameron Barker
,
Alexander Rutherford
,
Davide Paglieri
Aidan Scannell
Aidan Scannell
,
Henry Gouk
,
Elliot J. Crowley
,
Tim Rocktäschel
,
Amos Storkey
· 0 min read
Abstract
As language models are increasingly deployed as autonomous agents, they will need to coordinate with others in long-horizon, open-ended interactive tasks. Yet current evaluations rarely test these demands together, focusing instead on short interactions, single-agent open-ended tasks, or highly structured multi-agent settings. We introduce $alem$, a JAX-based, procedurally generated open-ended benchmark for multi-agent coordination built on Craftax-like dynamics. Alem embeds procedurally generated coordination tasks, soft specialisation, communication, and controllable coordination difficulty into a long-horizon, game-like survival world with exploration, crafting, trading, and combat. We evaluate $13$ modern LLMs zero-shot within homogeneous teams, with trained MARL agents as reference points. Most LLM agents struggle, averaging only ~6% of maximum reward, but their failures are not uniform. Gemini-3.1-Pro-High approaches MARL agents trained for one billion steps on Hard coordination ($17.5%$ vs. $17.6%$ Coord.%), while GPT-5.4-High achieves strong base-task reward but much lower coordination reward. This contrast shows both the promise and the limitation of frontier LLM agents - a zero-shot model can approach trained MARL coordination performance, yet individual task competence does not imply coordination competence. Ablations further show that communication is central for sharing intent, while memory and reasoning help when they support multi-step planning. These results identify coordination as a distinct bottleneck for current LLM agents, separate from single-agent capabilities. Alem makes this bottleneck measurable, providing a controlled setting for developing agents that can communicate, allocate roles, and execute shared plans.
Type
Publication
arXiv preprint arXiv:2606.08340