AI Benchmark Center
Live leaderboards across reasoning, coding, math, expert knowledge, and agent performance.
Current Leaders by Category
Overall Model Ranking
(avg. normalized score across all benchmarks)Long-Context QA (LC-QA-256K)
Long-Context QA (LC-QA-256K) · score
Jun 15, 2026
Jun 12, 2026
Jun 10, 2026
Jun 8, 2026
Jun 5, 2026
Multi-Document Summarization (MD-Summ-10Doc)
Multi-Document Summarization (MD-Summ-10Doc) · score
Jun 16, 2026
Jun 14, 2026
Jun 11, 2026
Jun 9, 2026
Jun 6, 2026
Complex Document Understanding (CDU-Financial)
Complex Document Understanding (CDU-Financial) · score
Jun 18, 2026
Jun 17, 2026
Jun 13, 2026
Jun 7, 2026
Jun 4, 2026
Legal Document Interpretation (LDI-Contract)
Legal Document Interpretation (LDI-Contract) · score
Jun 20, 2026
Jun 19, 2026
Jun 15, 2026
Jun 10, 2026
Jun 7, 2026
Scientific Paper Synthesis (SPS-Interdisciplinary)
Scientific Paper Synthesis (SPS-Interdisciplinary) · score
Jun 22, 2026
Jun 21, 2026
Jun 16, 2026
Jun 12, 2026
Jun 9, 2026
Flores-200
Flores-200 · score
May 15, 2026
Apr 22, 2026
May 1, 2026
Mar 10, 2026
Feb 28, 2026
M3Exam
M3Exam · score
Jun 1, 2026
May 15, 2026
Apr 22, 2026
May 20, 2026
Mar 10, 2026
XNLI
XNLI · score
Jun 5, 2026
May 25, 2026
Apr 22, 2026
Mar 10, 2026
May 1, 2026
IndicMT-Bench
IndicMT-Bench · score
May 8, 2026
Mar 10, 2026
May 15, 2026
Apr 15, 2026
Apr 22, 2026
AfriTranslate
AfriTranslate · score
Jun 1, 2026
Jun 10, 2026
May 2, 2026
May 15, 2026
Mar 10, 2026
About These Benchmarks
MMLU (Massive Multitask Language Understanding)
Tests knowledge across 57 subjects including STEM, humanities, and social sciences. 14,000+ questions.
HumanEval (Coding)
164 hand-crafted programming challenges. Measures ability to produce correct code from docstrings.
MATH
12,500 competition math problems from AMC, AIME, and AMC 10/12. Tests advanced mathematical reasoning.
SWE-bench Verified
Real GitHub issues from popular open-source repos. Measures end-to-end software engineering capability.