DAIR.AI on x

When does combining LLMs help?

Great analysis on combining language models, measured across 67 models from 21 providers.

Any policy that routes, votes, cascades, or runs a mixture of agents and then returns one model's answer is bounded above by 1 minus beta, where beta is the fraction of queries every candidate model gets wrong.

The common justification for ensembling is diversity, usually measured as low pairwise error correlation. The paper proves that correlation cannot identify beta, so decorrelation does not establish that headroom exists. And across the 67 models, real co-failures are far more concentrated than independence-style assumptions predict.

Before assuming a router or MoA setup will help, measure beta. Co-failures cluster on the answer format rather than the subject.

Paper: https://arxiv.org/abs/2606.27288

Learn to build effective AI agents in our academy: https://academy.dair.ai/