stAItuned logo
Article cover

LMArena: How the Web’s Biggest LLM Leaderboard Works

FREE

It’s 6:40 p.m. and your release train leaves tomorrow.
Two models are neck-and-neck—but one crushes long prompts while the other shines on code diffs. You open LMArena, run 20 real prompts across Overall → Coding → Longer Query, then try your repo in RepoChat. By 7:15 p.m., the pattern is obvious: route by task, note the trade-offs, and ship with confidence.


Answer in 30 seconds

If you need a fast, human-preference signal to shortlist LLMs, use LMArena: run anonymous battles with your real prompts, interpret small Elo gaps as noise (use the confidence interval), and pick per-task winners instead of a single global #1. Then re-check periodically because leaderboards and models drift.

What LMArena is (in 60 seconds)

LMArena (ex–Chatbot Arena) is a public leaderboard where people compare LLM answers head-to-head. Each anonymous vote updates an Elo-style rating. It’s ideal when you need to choose a model for a task (coding, web dev, long prompts, etc.) and want a human-preference signal—not just static benchmarks.

How it works (1-minute version)

  • Anonymous battles: two models answer the same prompt; you vote before names reveal.
  • Arena Elo: votes update ratings; leaderboard also shows MT-Bench and MMLU panels to triangulate quality.
  • Only anonymous battles move Elo: Side-by-Side is for labeled comparison; it doesn’t affect scores (see: [1], [2], and the official [3]).

Snapshot (as observed on Jan 29–Feb 1, 2026)

Rankings drift. Treat tables as snapshots, and re-check the live leaderboards before you ship: [4], [5], [6].
If you want to track changes without re-reading this guide, follow the official changelog: [7].

Text Arena — Last updated Jan 29, 2026

Rank Model Score
1 gemini-3-pro 1487±5
2 grok-4.1-thinking 1475±5
3 gemini-3-flash 1471±5
4 claude-opus-4-5-20251101-thinking-32k 1468±5
5 claude-opus-4-5-20251101 1466±5
6 grok-4.1 1466±4
7 gemini-3-flash (thinking-minimal) 1463±6
8 gpt-5.1-high 1459±5
9 ernie-5.0-0110 1453 (preliminary) ±7
10 claude-sonnet-4-5-20250929-thinking-32k 1450±4

Treat “Model” strings as leaderboard labels. Small gaps near the top are often noise unless confidence intervals separate.

WebDev Arena — snapshot (as of Feb 2, 2026)

Top signals:

  • GPT-5 (high) leads with 1472.37 (95% CI +8.18 / -6.69)
  • Claude Opus 4.1 thinking-16k (20250805) follows at 1456.34
  • Several models cluster tightly behind, with overlapping CIs.

Code Arena (WebDev) — Last updated Feb 1, 2026

Rank Model Score
1 claude-opus-4-5-20251101-thinking-32k 1500±9
2 gpt-5.2-high 1472±16
3 claude-opus-4-5-20251101 1470±9
4 gemini-3-pro 1453±8
5 kimi-k2.5-thinking 1447±16

Editorial note: today’s top cluster spans multiple vendors (e.g., Google, OpenAI, Anthropic, xAI, Baidu), which is exactly why “route by task” beats “pick a global #1”.


What changed since 2025 (how to keep this guide current)

LMArena evolves fast: new models enter, arenas change, and methodology updates get logged publicly.

My rule: treat leaderboard tables as snapshots and track changes via:

  1. the live leaderboards,
  2. the Leaderboard Changelog (methodology + additions),
  3. periodic re-checks on your prompt set.

Notable recent examples include new model additions to the Code leaderboard and broader evaluation updates logged in the official changelog: [7]. For product experiments that can affect how people use the Arena, see: [8].


Under the hood: Elo in 90 seconds

Mental model: LMArena is doing pairwise ranking at massive scale. Instead of asking people to score models on an absolute 1–10 scale, it asks a simpler question: “Which answer wins?” That tends to reduce rater variance and produces a more stable ordering.

  • Elo is relative. A model’s rating is updated by wins/losses vs opponents on the prompts it actually faced.
  • Confidence intervals (CI) matter. When two models’ 95% CIs overlap, a small gap (e.g., +5–15 Elo) is often indistinguishable from noise.
  • Selection bias is real. If a model mostly fights certain opponents or prompt types, its Elo reflects that mixture—not a universal “IQ”.
Big gap + non-overlapping CI  → likely real difference
Tiny gap + overlapping CI     → treat as a tie; break with your prompts

Three ways to use LMArena (hands-on)

1) Direct Chat — fast single-model sanity check

Use when you want one good draft quickly (tone/style, first pass).

Paste this prompt to try:

You are an AI assistant helping a software team. The team needs to decide whether to use retrieval-augmented generation (RAG) or fine-tuning for adapting a large language model to their domain (technical documentation + support tickets).
Write a concise recommendation memo (~250 words) that:

  1. explains pros/cons of both (clarity, cost, speed, maintenance),
  2. gives a final recommendation with reasoning,
  3. ends with Next steps bullets.
    Make it clear enough that a CTO could decide in 5 minutes.

What to look for: structure, trade-off clarity, actionable next steps.

Direct Chat view


2) Battle — anonymous A vs B (you vote after reading)

Best when you’re deciding between two candidates and want blind judgment.

How to run a good battle

  1. Paste a real prompt from your backlog.
  2. Read both answers before names reveal.
  3. Vote for the one you’d ship to a user today.

Quick rubric: correctness → reasoning → instruction-following → conciseness → citations (if any).

Battle view (both answers recommend RAG; judge clarity & specificity)


3) Side-by-Side — explicit model vs model

Pick two finalists (e.g., gemini-2.5-pro vs claude-opus-4.1-20250805-thinking-16k) for a labeled comparison.

Tips

  • Keep the same prompt across runs to compare apples to apples.
  • Do 5–10 prompts per task (long spec, coding bug, data wrangling).
  • Note failure modes: latency spikes, hallucinations, messy formatting.

Side-by-Side view


How to read the leaderboard without overfitting

  • Start with the right tab. If you ship front-end code, WebDev matters more than Text “Overall”.
  • Triangulate. Elo is human preference; cross-check the MT-Bench/MMLU panels shown next to rows.
  • Beware tiny gaps. A few points at the top are often noise—use battles on your prompts to decide.

Leaderboard overview tiles

Key takeaways (for production teams)

  • Treat Elo as a shortlist signal, not a final decision.
  • Break near-ties with your prompts, latency, and failure-case tests.
  • Prefer routing by task (coding vs long context vs domain QA) over a single “best model”.
  • Re-check monthly (or before major launches): model updates and leaderboard drift are normal.

A 20-minute evaluation recipe (teams)

  1. Collect 6–8 real prompts (2 coding, 2 long-form, 2 domain-specific).
  2. Shortlist 3 models from the relevant tab.
  3. Run 10 battles total (shuffle pairs).
  4. Record winners + notes (reasoning, latency, formatting).
  5. Pick per-task winners; route in prod by task if needed.

Lightweight monthly health-check

  • 6 prompts × 3 categories (Coding, Longer Query, Your Domain)
  • 6–10 battles per category vs last month’s winner
  • Log: win rate, latency, formatting, citation quality
  • Update your router only if deltas persist across two checks
  • Retrieval basics: /learn/expert/introduction-to-information-retrieval-systems
  • RAG architecture & trade-offs: /learn/expert/cag-vs-rag
  • Production fundamentals (context/tokens/tool use): /learn/midway/llm-practical-fundamentals-guide-ai-apps
  • 2026 enterprise playbook (agents, evaluation, governance): /learn/midway/generative-ai-roadmap-2026-enterprise-playbook

Curiosity bites

  • Side-by-Side ≠ Elo: SxS is labeled for audits; only anonymous battles change ratings (see: [2], plus the official [3]).
  • Repo-aware surprises: average overall models can win in RepoChat on your codebase.
  • Latency matters: ask raters to flag when answers are “too slow to be useful.”

Caveats & privacy

  • Human preference ≠ ground truth. Treat Elo as one signal; verify with your tests/benchmarks.
  • Small Elo deltas are noisy. Break ties with your own prompts.
  • Privacy: treat LMArena as public—don’t paste sensitive data.

Failure modes & Goodhart (how to not get fooled)

  • Prompt-style bias: “polished” writing can beat correct-but-dry answers in preference voting.
  • Verbosity bias: longer answers often “feel” better—even when they’re less precise.
  • Rater variance: skill, patience, and domain knowledge vary wildly across voters.
  • Domain mismatch: a model that wins general prompts can lose badly on your niche format, jargon, or constraints.
  • Post-training artifacts: alignment, memorization, and “benchmark behavior” can move rankings without improving real-world reliability.

Don’t pick a single winner—pick a routing plan

Run 10–20 arena battles per task type, compare Overall + Category Elo with MT-Bench/MMLU, and keep a lightweight rubric (correctness, latency, citations/tool-use). If the top two are close, prefer the one that wins your failure cases. Re-check monthly; model drift is real.


Mini-template: 20-minute LMArena eval (copy/paste)

Goal
- Task(s):
- Constraints (latency, format, citations, tools):

Prompt set (6–8)
1.
2.
3.

Shortlist
- Model A:
- Model B:
- Model C:

Battles (10–20)
- Prompt #1: winner, why (notes)
- Prompt #2: winner, why (notes)

Operational notes
- Latency: (p50/p95 impressions)
- Formatting issues:
- Failure modes observed:

Ship decision
- Winner per task:
- Router rules (if any):
- Recheck date:

FAQ

Tip: Each question below expands to a concise, production-oriented answer.

What exactly is LMArena?

A public, anonymous, pairwise arena where humans vote on better answers; an Elo-style system turns those votes into a live leaderboard.

Is Elo just “who sounds nicer”?

It’s a preference signal. Use it with MT-Bench/MMLU and category filters to avoid style-over-substance traps.

Which model is #1 right now?

It changes. Check Overall, then your category; small Elo gaps are noise—validate on your prompts.

Are my prompts private?

Treat them as not private; don’t paste sensitive data.

Why do teams use LMArena?

To de-risk launches, route by task, detect regressions, tune prompts/guardrails, and validate cost–performance trade-offs on real prompts.

Is Elo reliable on its own?

It captures human preference, not ground truth. Triangulate with MT-Bench, MMLU, and category leaderboards, and validate on your prompts.

How do I add my model?

Follow FastChat’s How to add a new model (host your own/3rd-party API or request LMSYS hosting).

Is there recent open data I can analyze?

Yes—LMArena/LMSYS periodically releases open datasets. Starting points:

  • Chatbot Arena Conversations (33k): [9].
  • Arena Human Preference (≈140k): [10].
  • LMSYS-Chat-1M (1M): [11].
  1. How it works
  2. Battle Mode help
  3. FAQ
  4. LMArena Text Leaderboard
  5. LMArena Code Leaderboard
  6. LMArena WebDev Leaderboard
  7. Leaderboard Changelog
  8. Auto-Modality Experiments
  9. Chatbot Arena Conversations (33k) | Hugging Face
  10. Arena Human Preference (≈140k) | Hugging Face
  11. LMSYS-Chat-1M (1M) | Hugging Face
Rate this article

About the Author

Daniele Moltisanti

Daniele Moltisanti

Principal Data Scientist | AI Lead Strategy

I am passionate about taking advantage of data to improve the everyday tasks of real life. I’m a **data scientist** with an **engineering mindset**, open to innovation in each field, specialized in **deep learning** techniques. Critical with me and always looking for feedback to learn how to deal with complex or ineffective situations. **Winner of the Nova List 111 Italy**, as the **top 10** professional under 35 in Italy about *Media, Communication and Entertainment*. <br /> **Experience** - &nbsp; Data Scientist Manager @**Sky Italia** [03/2022 - present] - &nbsp; Founder & Manager Director @**stAI tuned** [02/2022 - present] - &nbsp; R&D Data Scientist @**Youtility Center** [01/2021 - 03/2022] - &nbsp; Data Scientist @**Accenture** [05/2020 - 01/2021] <br /> **Education** - &nbsp; Computer Science & Engineering @**Politecnico Di Milano** - &nbsp; Erasmus @**Universidad Politecnica de Valencia** - &nbsp; Computer Engineering @**Politecnico Di Milano** <br /> **About me** - 🗣️ Social Life - 🏀 Sport as a wellness activity - 🍲 Passion for cooking - 🖥️ Technology oriented

Badges
Gold Impact Writer
Gold Writer
Silver Impact Writer
Silver Writer

✍️ Vuoi scrivere per stAItuned?

Condividi le tue competenze con una audience qualificata

Scopri di più