lmms-eval
Releases

Lmms Eval 0.6

LMMs-Eval v0.6

Overview

After developing LMMs-Eval for over a year, we've integrated 100+ tasks and 30+ models across images, videos, and audio. Throughout this journey, we've grown increasingly aware that evaluation itself deserves the same rigor we demand from the models we evaluate. A benchmark that cannot reliably indicate a model's true capabilities is not just unhelpful, it actively misleads research directions.

This realization drives v0.6: a re-architecture designed to make evaluation fast enough to iterate, rigorous enough to trust, and challenging enough to matter.

Building on lmms-eval's existing framework, v0.6 transforms evaluation from a one-off script into a production-ready evaluation system. This enables two critical workflows:

  • During training: Evaluation runs as a standalone service, decoupled from the training loop. Submit checkpoints for async evaluation without blocking GPU training.
  • Post-training: Rapid, comprehensive evaluation across all modalities with statistical guarantees on the results.

From the engineering side, v0.6 also ships a substantial API-throughput upgrade. With the latest API control-path updates (adaptive concurrency, refill scheduling, prefix-aware queueing, and retry/backoff decoupling), we observe about 7.5x throughput improvement on a fixed LIMIT=100 benchmark (0.3278 -> 2.4584 req/s), while preserving metric outputs for the same task/model setup.

AreaKey Features
PerformanceFully async and decoupled inference; adaptive API concurrency control; prefix-aware queueing; measured ~7x+ throughput gain on API benchmark path
Evaluation as a ServiceAsync job submission without blocking GPU training; separately hosted eval service on dedicated GPUs
Statistical RigorConfidence intervals, clustered standard errors, baseline-anchored paired comparison
Frontier EvaluationLong video, spatial intelligence, and agentic scenarios

1. Architecture

1.1 Evaluation Pipeline

v0.6 defines evaluation as three decoupled components:

┌─────────┐      ┌─────────┐      ┌─────────┐
│  Input  │ ───► │  Model  │ ───► │  Judge  │
└─────────┘      └─────────┘      └─────────┘
ComponentContentsOutput
InputMultimodal data + question + ground truthInstance objects
ModelLMM inference (local or API)Generated responses
JudgeMetric computation (exact match, LLM judge, etc.)Scores

Async Pipeline with Cache

Both stages run asynchronously with intelligent caching:

graph LR
    Input[Input]
    Model[Model]
    Storage[Storage]
    Judge[Judge]
    
    Input -- async --> Model
    Model -- "cache" --> Storage
    Storage -- async --> Judge

    Key["Cache key: hash(input + config + git_commit)"]
    
    Model -.-> Key

Cache System

Avoid redundant inference when the same evaluation has been run before:

Cache Key ComponentPurpose
Input hashSame dataset + question
Config hashSame generation parameters (temp, max_tokens, etc.)
Git commitSame model code version

Cache hit -> skip inference, reuse stored outputs. Cache miss -> run inference, store results.

Benefits:

  • No redundant computation: identical runs return cached results instantly
  • Crash recovery: resume from cached outputs without re-inference
  • Resource separation: Model (GPU) and Judge (API) can run on different machines
  • Reproducibility: cache key ensures exact same conditions

Stage 1: Input -> Model

Prefix Cache Optimization

vLLM/SGLang reuse KV cache for shared prefixes. We cluster inputs by media to maximize hits:

  • Prefix clustering: Group questions by shared image/video
  • Length sorting: Similar lengths in same batch
  • Chunked prefill: Split long prompts into chunks to reduce TTFT and avoid OOM

Stage 2: Model -> Judge

Persist-First Strategy: Save model outputs to disk immediately, then score asynchronously.

1.2 Model Interface & Backends

Decoupled Design

v0.6 separates evaluation logic from model inference. Models implement a standardized LMM abstract base class:

class LMM(ABC):
    def generate_until(self, requests: list[Instance]) -> list[str]:
        """Batched text generation."""
        pass

    def loglikelihood(self, requests: list[Instance]) -> list[float]:
        """Token-level log probabilities for multiple-choice ranking."""
        pass

The Instance object carries:

  • media_list: Raw images, video paths, or audio buffers
  • prompt: Text prompt
  • gen_kwargs: Generation parameters

Supported Backends

BackendExample
vLLMpython -m lmms_eval --model vllm --model_args pretrained=Qwen/Qwen2.5-VL-7B
SGLangpython -m lmms_eval --model sglang --model_args pretrained=Qwen/Qwen2.5-VL-7B
API ModelsOpenAI, Anthropic, Groq, etc. - see API Concurrency & Throughput

Unified API Model Interfaces

v0.6 unifies API-backed model evaluation under two interfaces:

Interface--model nameBackendRecommended
Asyncasync_openaiasyncio + AsyncOpenAIYes
SyncopenaiThreadPoolExecutor + OpenAIFallback

We recommend async_openai for all API-backed evaluation — it uses native async I/O and achieves significantly higher throughput.

Both resolve to chat mode by default via Model Registry V2. The simple mode (doc_to_visual + doc_to_text) is deprecated and will be removed in a future release.

Naming change in v0.6: the canonical model names have been shortened from openai_compatible / async_openai_compatible to openai / async_openai. These are the names used in filenames, registry keys, and @register_model decorators. The old names (openai_compatible, openai_compatible_chat, async_openai_compatible, async_openai_compatible_chat) continue to work as aliases via MODEL_ALIASES in __init__.py, so existing scripts are not affected.

Model Registry V2

v0.6 introduces ModelRegistryV2 — a unified model registry that replaces the previous ad-hoc import system. All model names (--model X) resolve through a single path.

How it works

Two dicts in lmms_eval/models/__init__.py declare available models:

  • AVAILABLE_SIMPLE_MODELS: maps model_id -> ClassName for simple (legacy) models in models/simple/
  • AVAILABLE_CHAT_TEMPLATE_MODELS: maps model_id -> ClassName for chat models in models/chat/

At startup, the registry merges both dicts into ModelManifest objects. Each manifest holds a model_id and up to two class paths (simple + chat). Class paths are auto-constructed: lmms_eval.models.\{type\}.\{model_id\}.\{ClassName\}, so the dict key must match the filename.

Resolution: chat is always preferred over simple (unless force_simple=True). This means --model openai transparently resolves to the chat implementation.

Aliasing: backward-compatible names are supported via MODEL_ALIASES in __init__.py and via ModelManifest.aliases. Old names like openai_compatible, openai_compatible_chat, async_openai_compatible, and async_openai_compatible_chat continue to work.

Simple mode deprecation: the simple model interface (doc_to_visual + doc_to_text) for API models is deprecated. New integrations should always use chat (doc_to_messages + ChatMessages). The simple implementations in models/simple/openai.py will be removed in a future release.

1.3 API Concurrency & Throughput

v0.6 adds adaptive concurrency for API-backed evaluation (async_openai, openai).

Adaptive Concurrency Control

The controller continuously adjusts in-flight request count using three online signals:

  • request failure rate
  • rate-limit hit rate (e.g., 429 / throttling)
  • p95 latency against a target latency budget

Execution also uses refill-style scheduling (no full-window barrier), so completed requests immediately release slots for new work.

For API models with repeated prompt prefixes, v0.6 also supports prefix-aware queueing to improve prefill-cache hit opportunities by dispatching same-prefix requests close together.

ControlMeaning
adaptive_concurrencyEnable/disable adaptive mode
adaptive_min_concurrencyLower bound for concurrency
adaptive_max_concurrencyUpper bound for concurrency
adaptive_target_latency_sTarget p95 latency budget
adaptive_increase_stepAdditive growth step when healthy
adaptive_decrease_factorMultiplicative decrease on pressure
adaptive_failure_thresholdFailure-rate threshold for backoff
retry_backoff_sRetry sleep interval (separate from request timeout)
prefix_aware_queueGroup dispatch by prefix hash
prefix_hash_charsPrefix length used for hashing

Example (sync API backend):

python -m lmms_eval \
  --model openai \
  --model_args model_version=<model>,num_concurrent=16,adaptive_concurrency=true,adaptive_min_concurrency=1,adaptive_max_concurrency=64,adaptive_target_latency_s=15.0,adaptive_increase_step=0.15,adaptive_decrease_factor=0.75,adaptive_failure_threshold=0.05,retry_backoff_s=1.0,prefix_aware_queue=true,prefix_hash_chars=256

Example (async API backend, recommended):

python -m lmms_eval \
  --model async_openai \
  --model_args model_version=<model>,num_cpus=16,adaptive_concurrency=true,adaptive_min_concurrency=1,adaptive_max_concurrency=64,adaptive_target_latency_s=15.0,adaptive_increase_step=0.15,adaptive_decrease_factor=0.75,adaptive_failure_threshold=0.05,retry_backoff_s=1.0,prefix_aware_queue=true,prefix_hash_chars=256

Performance Snapshot

To make the performance claim auditable, we keep a concrete benchmark trail in this repo:

  • Historical comparison file: logs/openrouter_molmo_throughput/throughput_comparison.csv
  • Latest-vs-previous comparison file: logs/openrouter_molmo_throughput/throughput_comparison_latest_vs_prev.csv

Benchmark setup used for this snapshot:

  • Task: mme
  • Limit: 100
  • Model backend: openai / async_openai API path
  • API endpoint family: OpenRouter-compatible
  • Model: bytedance-seed/seed-1.6-flash
  • Baseline control: static single concurrency (num_concurrent=1)
  • Latest control: adaptive + refill scheduling + prefix-aware queueing + explicit retry backoff

Result summary (requests_per_sec):

Run TypeConcurrencyRPSWall Time (s)Relative to Baseline
baseline10.327836305.0307401.00x
static241.92698751.8944735.88x
adaptive (v1)162.40470641.5851217.33x
adaptive (v2)162.45843540.6762797.50x

Interpretation:

  • The latest API control path reaches about 7.5x throughput over baseline on the same LIMIT=100 setup.
  • Compared to the previous adaptive run (v1), the latest adaptive run (v2) still improves (2.4047 -> 2.4584 req/s, +2.23%). This is a small but measurable delta in a noisy environment (shared network + provider-side scheduling), so the right takeaway is not "a new ceiling", but "less overhead and better utilization under the same constraints."
  • The core point: this speedup is not from changing benchmark difficulty. We keep the same task (mme), model (bytedance-seed/seed-1.6-flash), limit (100), and evaluation prompts/settings. The gain comes from changes in the API request scheduling/control path.
  • What adaptive (v2) means in practice:
    • Refill scheduling (no window barrier): maintain a steady pool of in-flight requests and immediately dispatch new work as soon as a request completes. This reduces idle gaps and prevents the slowest request in a window from gating progress.
    • Rolling controller updates: adjust concurrency based on a rolling batch of completions (failure rate, rate-limit hits, and p95 latency vs target) rather than only after fixed windows. This makes the controller more responsive and less sensitive to outliers.
    • Hysteresis for stability: use separate "reduce" vs "increase" conditions (and minimum sample thresholds) to avoid oscillating on a single transient 429 or a brief latency spike.
    • Retry/backoff decoupling: retry_backoff_s is explicitly separate from request timeout, so retries don't sleep for long timeouts and tie up worker slots.
    • Prefix-aware queueing (when enabled): reorder dispatch by prefix hash so same-prefix requests are sent close together, improving prefill-cache hit opportunities on providers that support prefix caching. (Some routing layers may dilute this benefit; the mechanism is still safe.)

1.4 Data Layer

Storage Format

FormatRandom AccessMedia HandlingUse in v0.6
JSON/FilesO(N) scanExternal filesLegacy
ParquetRow-group decompressBinary blobsPrimary format

Parquet: Task metadata (questions, answers, splits). Supports projection pushdown, read only required columns. Underlying format for HF Datasets (disk).

TODO: Optimize for high-throughput multimodal access (e.g., Lance or similar columnar storage for images/videos).

1.5 Evaluation as a Service

To integrate evaluation into training workflows, v0.6 provides a disaggregated HTTP service architecture. Implementation: lmms-engine#127

┌─────────────────┐          ┌─────────────────┐           ┌─────────────────┐
│  Training Loop  │ ──POST──▶│   Eval Server   │ ──queue──▶│   Job Worker    │
│   (any host)    │◀──poll── │   (FastAPI)     │◀──result──│   (GPU node)    │
└─────────────────┘          └─────────────────┘           └─────────────────┘

Key benefit: Training continues while evaluation runs asynchronously on separate resources.

Server API

EndpointMethodDescription
/evaluatePOSTSubmit evaluation job (model, tasks, config)
/jobs/\{job_id\}GETQuery job status and results
/queueGETView pending/running/completed jobs
/tasksGETList available evaluation tasks
/modelsGETList supported model backends

Client Usage

from lmms_eval.entrypoints import EvalClient

client = EvalClient("http://eval-server:8000")

# Submit evaluation (non-blocking)
job = client.evaluate(
    model="qwen2_5_vl",
    tasks=["mmmu_val", "mme"],
    model_args=\{"pretrained": "Qwen/Qwen2.5-VL-7B-Instruct"\},
)

# Continue training...
# Later, retrieve results
result = client.wait_for_job(job["job_id"])

The server uses a JobScheduler that queues requests and processes them sequentially, ensuring proper GPU resource management without conflicts.


2. Statistical Analysis

2.1 Why Statistical Analysis?

Current leaderboards rank models by mean accuracy without uncertainty quantification. As Anthropic's research demonstrates, this is fundamentally flawed:

ProblemExampleConsequence
Scores are estimates, not truth85% on 1000 questions ≠ 85% true capabilityFalse confidence in rankings
Small differences are noise85.2% vs 85.5% is statistically insignificantWasted effort chasing noise
Correlated questions inflate precision10 questions per video ≠ 10 independent samplesUnderestimated uncertainty

The fix: Treat evaluation as a sampling experiment. Report confidence intervals. Use clustered standard errors. Compare models with paired tests.

What v0.6 Adds

FeaturePurpose
Standard Error (SE)Quantify uncertainty of single model score
Confidence IntervalsReport score ± margin, not point estimate
Clustered SECorrect for correlated questions (same video/image)
Paired ComparisonDetect small differences by removing question difficulty variance
Model StabilityMeasure inherent variance under standard settings

2.2 Standard Error Estimation

Independent Samples

For binary metrics (pass/fail), SE simplifies to:

SE = \sqrt\{\frac\{p(1-p)\}\{n\}\}

where p = accuracy, n = number of questions.

Key insight: SE ∝ 1/\sqrt\{n\}. To halve uncertainty -> 4× more questions.

Output format: score ± 1.96 × SE (95% CI)

Clustered Samples

Problem: Multiple questions per video/image are correlated, not independent.

Solution: Specify cluster_key -> system applies cluster-robust SE correction.

task: videomme
cluster_key: video_id

Clustered SE can be 3× larger than naive estimates.

2.3 Model Comparison

Problem: Checking if confidence intervals overlap is low-power.

Solution: Paired test — compute per-question difference d_i = score_A - score_B, then test if mean(d) ≠ 0.

Why: Removes question difficulty variance (dominant noise), isolates model difference signal.

Baseline-Anchored Evaluation

A practical application of paired comparison: anchor evaluations to a standard baseline model (e.g., Gemini 3.0 Pro).

ApproachReportLimitation
Absolute score"Our model: 78.3%"Meaningless without context
Leaderboard rank"#3 on MMMU"Rank doesn't quantify gap
Paired difference"+2.1% vs Gemini 3.0 Pro (p<0.01)"Statistically grounded claim

Benefits:

  • Reproducible claims: "We beat baseline X by Y%" is verifiable
  • Training signal: Track improvement over baseline across checkpoints
  • Publication-ready: Statistical significance replaces hand-waving
# Example: Compare your model against Gemini 3.0 Pro baseline
results = paired_comparison(
    model_a="your_model",
    model_b="gemini-3.0-pro",  # Standard baseline
    tasks=["mmmu_val", "mathvista"],
)
# Output: mean_diff=+2.1%, CI=[+0.8%, +3.4%], p=0.003

2.4 Power Analysis

Purpose: Minimum sample size to detect a given effect (e.g., 2% improvement).

Rule of thumb: Reliable benchmarks need n > 1000 questions.

2.5 Model Stability Measurement

Why Measure Variance?

Two models with 80% accuracy can behave differently:

  • Model A: Answers consistently (same questions right/wrong each run)
  • Model B: Answers randomly (different results each run)

Model A is more reliable. Model B's accuracy is "luck."

Goal: Measure model's inherent stability under standard settings (temp=0.7).

Law of Total Variance

Var(Score) = \underbrace\{Var_\{within\}\}_\{\text\{Model instability\}\} + \underbrace\{Var_\{between\}\}_\{\text\{Question difficulty\}\}

The first term measures model stability — lower is better.

Protocol

Run N samples per question (temp=0.7), report:

MetricMeaning
Expected Accuracy (EA)Mean accuracy across all N samples
Consensus Accuracy (CA)Accuracy after majority vote
Internal Variance (IV)Model instability — lower is better
Consistency Rate (CR)% questions with same answer across N runs

Example Output

┌─────────────┬─────┬─────┬───────┐
│ Model       │ EA  │ CA  │  IV   │
├─────────────┼─────┼─────┼───────┤
│ Model A     │ 80% │ 82% │ 0.05  │  ← Stable
│ Model B     │ 80% │ 81% │ 0.15  │  ← Unstable
└─────────────┴─────┴─────┴───────┘

Same accuracy, but Model A is 3× more stable.

Question-Level Diagnostics

PatternPossible Cause
High IV across all modelsAmbiguous question
High IV for one modelModel-specific weakness
Zero IV, always wrongConfidently wrong knowledge

3. Evaluating Multimodal Models in 2026 (TODO)

Note: This section outlines planned evaluation features. Implementation is in progress.

3.1 More features are expected

Static image QA benchmarks are saturating. Building multimodal systems requires setting more challenging tasks and evaluating them in more realistic scenarios:

CapabilityChallengeCurrent Gap
Long video understanding10min+ videos, 1000+ framesMost benchmarks use <128 frames
High motionObjects movement at 30fpsSparse sampling loses fine-grained actions
Spatial reasoning3D world understanding2D perception ≠ physical grounding
Agentic interactionMulti-step task execution and feedbackStatic QA can't measure planning/tool use well

Key insight: These capabilities require in-environment evaluation, the model must interact with simulators, receive feedback, and adapt. Static input-output pairs cannot capture this.

3.2 Long Video & High Frame Rate

The Scale Problem

ScenarioFramesTokens (est.)Challenge
1min video @ 1fps60~60KFits context
10min video @ 1fps600~600KExceeds most context windows
1min video @ 30fps1800~1.8MMemory explosion

Streaming metrics:

  • Event detection latency: Time from event occurrence to model detection
  • Memory efficiency: Performance vs. KV cache size
  • Graceful degradation: Accuracy when forced to evict old context

3.3 Spatial Intelligence

Spatial intelligence benchmarks are needed to evaluate the model's ability to reason in real-world scenarios.

3.4 Agentic Evaluation in Simulators

Interaction based simulators are needed to evaluate agentic capabilities, instead of relying on static benchmarks.

4. Migration Notes

4.1 OpenAI Judge Model Deprecation

Many tasks in lmms-eval use OpenAI models as LLM judges for scoring (e.g., evaluating free-form answers, computing GPT-based metrics). These judge model names are hardcoded as defaults in individual task utils.py files and YAML configs.

As OpenAI deprecates older model versions, users may need to switch to newer models. Below is the official deprecation timeline and our recommended migration mapping.

Official Deprecation Timeline

Source: OpenAI Deprecations (last checked: 2026-02-18)

GPT-3.5 series

ModelShutdown DateStatusRecommended Replacement
gpt-3.5-turbo-03012024-09-13Already shut downgpt-3.5-turbo
gpt-3.5-turbo-06132024-09-13Already shut downgpt-3.5-turbo
gpt-3.5-turbo-16k-06132024-09-13Already shut downgpt-3.5-turbo
gpt-3.5-turbo-instruct2026-09-28Deprecatedgpt-5-mini or gpt-4.1-mini
gpt-3.5-turbo-11062026-09-28Deprecatedgpt-5-mini or gpt-4.1-mini

GPT-4 series

ModelShutdown DateStatusRecommended Replacement
gpt-4-vision-preview2024-12-06Already shut downgpt-4o
gpt-4-1106-vision-preview2024-12-06Already shut downgpt-4o
gpt-4-32k / gpt-4-32k-06132025-06-06Already shut downgpt-4o
gpt-4-32k-03142025-06-06Already shut downgpt-4o
gpt-4-03142026-03-26Deprecatedgpt-5 or gpt-4.1
gpt-4-1106-preview2026-03-26Deprecatedgpt-5 or gpt-4.1
gpt-4-0125-preview2026-03-26Deprecatedgpt-5 or gpt-4.1
gpt-4.5-preview2025-07-14Already shut downgpt-4.1

GPT-4o series (audio/realtime variants)

ModelShutdown DateStatusRecommended Replacement
gpt-4o-audio-preview-2024-10-012025-10-10Already shut downgpt-audio
chatgpt-4o-latest2026-02-17Deprecatedgpt-5.1-chat-latest
gpt-4o-audio-preview2026-03-24Deprecatedgpt-audio
gpt-4o-mini-audio-preview2026-03-24Deprecatedgpt-audio-mini
gpt-4o-realtime-preview2026-03-24Deprecatedgpt-realtime
gpt-4o-mini-realtime-preview2026-03-24Deprecatedgpt-realtime-mini

Note: gpt-4o and gpt-4o-mini (the base chat models) do not have announced deprecation dates as of 2026-02-18. They remain current models.

When a judge model used by a task reaches its shutdown date, override it with a replacement:

Current Default in CodeRecommended ReplacementNotes
gpt-4o / gpt-4o-2024-*gpt-5-miniCost-efficient GPT-5 variant, direct successor
gpt-4o-minigpt-5-nanoCheapest GPT-5 variant
gpt-3.5-turbo / gpt-3.5-turbo-*gpt-5-nanoLegacy model, cheapest replacement
gpt-4o-audio-previewgpt-audioDedicated audio model (shutdown 2026-03-24)
gpt-4.1 / gpt-4.1-minigpt-5-mini / gpt-5-nanoGPT-5 series supersedes 4.1

Why we keep the old defaults in code: Task configs retain their original model references to preserve reproducibility of published results. Changing defaults would silently alter scoring behavior for existing benchmarks.

How to override: Most tasks read the judge model from environment variables. Set MODEL_VERSION (or the task-specific variable) to use a newer model:

export MODEL_VERSION="gpt-5-mini"
python -m lmms_eval --model qwen2_5_vl --tasks mme --limit 5

Some tasks use different environment variable names (e.g., BABYVISION_MODEL_NAME, VIESCORE_MODEL_NAME, JUDGE_MODEL_NAME). Check the relevant task's utils.py for the exact variable name.


References

Core: Statistical Evaluation Framework

A Statistical Approach to Model Evaluations Miller et al., Anthropic, 2024

The theoretical foundation for Section 2. Key contributions:

  • Treating evaluations as sampling experiments
  • Standard errors and confidence intervals for LLM benchmarks
  • Clustered standard errors for correlated questions
  • Paired difference analysis for model comparison

Evaluation Frameworks

Data Formats

Benchmarks