lmms-eval

Introduction

LMMs-Eval Documentation

This documentation covers every layer of lmms-eval — from running your first evaluation to adding custom models and tasks. The framework evaluates large multimodal models across image, video, and audio benchmarks with a single unified pipeline.

Table of Contents

How the Evaluation Pipeline Works

Every evaluation follows the same six-stage pipeline. Each stage has dedicated documentation, and failures at any stage produce clear error messages indicating what went wrong.

User input: --model openai --tasks mmmu_val,video_mmmu,longvideobench_val_v


    ┌─ CLI Parsing ─────────────── commands.md
    │   Parse flags, resolve config, dispatch to evaluator

    ├─ Model Resolution ────────── model_guide.md
    │   Map model name -> Python class, validate chat/simple type

    ├─ Task Loading ────────────── task_guide.md
    │   Discover YAML configs, build prompts, load datasets

    ├─ Model Inference ─────────── run_examples.md
    │   Send prompts to model, collect responses

    ├─ Response Caching ────────── caching.md
    │   Store deterministic responses, skip redundant calls

    └─ Metric Aggregation ──────── throughput_metrics.md
        Score responses, compute task-level metrics, write output

A minimal evaluation runs the entire pipeline with a single command. This example evaluates GPT-4.1-mini on MMMU via the OpenAI-compatible API:

export OPENAI_API_KEY="your-api-key"

python -m lmms_eval \
  --model openai \
  --model_args model_version=gpt-4.1-mini \
  --tasks mmmu_val \
  --batch_size 1 \
  --limit 8

The same pattern works for any OpenAI-compatible endpoint (OpenRouter, Azure, local vLLM/SGLang servers). To evaluate across image and video tasks together:

python -m lmms_eval \
  --model openai \
  --model_args model_version=gpt-4.1-mini \
  --tasks mmmu_val,video_mmmu,longvideobench_val_v \
  --batch_size 1 \
  --limit 8 \
  --log_samples \
  --output_path ./results/

Why it's Efficient and Trustworthy

The pipeline above is designed to be fast enough to iterate on and rigorous enough to trust. This section summarizes the concrete mechanisms that back those claims. Each links to deeper documentation.

Efficient

Evaluation should not be the bottleneck. Four layers of optimization keep GPUs and API endpoints saturated end to end.

LayerMechanismImpact
API throughputAdaptive concurrency control with refill scheduling, prefix-aware queueing, and retry/backoff decoupling.~7.5x throughput over v0.5 on fixed benchmarks (v0.6 release notes).
Response cachingSQLite + JSONL write-ahead log stores deterministic responses (temperature=0, do_sample=False). Subsequent runs skip inference entirely for cached samples.Zero redundant model calls on repeated or resumed runs (caching guide).
Video I/OTorchCodec multi-threaded decode replaces single-threaded PyAV. Lance-backed blob storage on Hugging Face enables single-IOP random access per video.Up to 3.58x faster frame decode; eliminates full-table scans (v0.7 release notes).
Prefix KV reuseRequests are clustered by shared media and sorted by length so vLLM/SGLang can maximize KV cache hits across a batch.Fewer redundant prefill computations on shared-image/video tasks.

The async pipeline decouples model inference from metric scoring. Model outputs are persisted immediately, and scoring runs independently - so a crash after inference does not lose results, and judge-stage work (exact match, LLM-as-judge) can run on a separate machine or at a later time.

Trustworthy

A benchmark score is only useful if it is reproducible and statistically grounded. The framework provides four categories of trust guarantees.

Reproducibility. Every run is seeded (--seed controls Python random, NumPy, and PyTorch) and every task config is fingerprinted. The cache key includes a SHA-256 of the schema version, task YAML, generation kwargs, and document ID - so any change to the prompt, parameters, or task definition automatically invalidates stale results. The test suite includes prompt stability snapshots for 8 classic benchmarks to catch unintended prompt regressions.

Statistical rigor. Point estimates hide uncertainty. The framework reports confidence intervals and supports clustered standard errors for benchmarks with correlated questions (e.g., multiple questions per video). Paired comparison with a baseline model removes question-difficulty variance, isolating the actual model difference with a p-value - replacing hand-waving with verifiable claims. Power analysis tools help determine the minimum sample size needed to detect a given improvement. See the v0.6 release notes for the full statistical methodology.

Cache integrity. Responses are validated before storage: None, empty strings, and malformed loglikelihood tuples are rejected. The JSONL write-ahead log is fsynced before the SQLite upsert, so a crash between the two writes is recovered on the next startup. Per-rank files prevent write contention in distributed runs. Details in the caching guide.

Clean scoring. Reasoning models (Qwen3-VL, DeepSeek-R1, QwQ) emit \<think\>...</think> blocks that must not leak into metric computation. The pipeline strips reasoning tags before scoring and preserves the raw output in a separate resps field for analysis. This is configured globally via --reasoning_tags or per-task in the YAML config. See the commands guide for usage.

Getting Started

Start here if this is your first time using lmms-eval.

GuideWhat You Learn
Quick StartClone, install, and run your first evaluation in 5 minutes.
Commands GuideEvery CLI flag explained — model selection, task filtering, batching, caching, output control, and seed management.
Run ExamplesCopy-paste commands for LLaVA, Qwen, InternVL, VILA, GPT-4o, and other models across image, video, and audio tasks.

Extending the Framework

These guides walk through adding your own models and tasks.

Adding a Model

The Model Guide covers the full process: subclass lmms_eval.api.model.lmms, implement generate_until, and register a ModelManifest. Chat models (recommended) receive structured messages with roles and typed content. Simple models (legacy) receive plain text with \<image\> placeholders.

from lmms_eval.api.registry import register_model
from lmms_eval.api.model import lmms

@register_model("my_model")
class MyModel(lmms):
    is_simple = False  # chat model

    def generate_until(self, requests):
        for request in requests:
            doc_to_messages, gen_kwargs, doc_id, task, split = request.args
            messages = doc_to_messages(self.task_dict[task][split][doc_id])
            # ... run inference and store result ...

Adding a Task

The Task Guide explains the YAML configuration format. Each task defines a dataset source, prompt template, generation parameters, and scoring function. The simplest tasks require only a YAML file; complex tasks add a utils.py with custom prompt formatting and metric computation.

task: "my_benchmark"
dataset_path: "my-org/my-dataset"
test_split: test
output_type: generate_until
doc_to_messages: !function utils.my_doc_to_messages
process_results: !function utils.my_process_results
generation_kwargs:
  max_new_tokens: 1024
  temperature: 0
metric_list:
  - metric: acc

Using lmms-eval as a Library

The External Usage guide covers three access patterns beyond the standard CLI:

CLI task browsing lists registered tasks, groups, and model backends without downloading datasets:

lmms-eval tasks subtasks     # table of all leaf tasks with YAML paths
lmms-eval models --aliases   # all model backends with alias mappings

Web UI provides a browser-based interface for configuring and launching evaluations:

uv run lmms-eval-ui          # opens browser, requires Node.js 18+

Python API gives programmatic access to tasks, datasets, and the evaluator:

from lmms_eval import evaluator

results = evaluator.simple_evaluate(
    model="openai",
    model_args="model_version=gpt-4.1-mini",
    tasks=["mmmu_val", "video_mmmu", "longvideobench_val_v"],
    batch_size=1,
    limit=8,
)

Performance and Caching

GuideWhat You Learn
CachingSQLite-backed response cache for deterministic requests. Store, replay, merge shards across distributed ranks, and recover from crashes via JSONL audit log.
Throughput MetricsInference timing metrics logged by chat models — end-to-end latency, time to first token, tokens per second, and batch-level summaries.

The response cache stores only deterministic requests (temperature=0, do_sample=False). Enable it with --use_cache ./eval_cache to skip redundant model calls on repeated runs:

python -m lmms_eval \
  --model openai \
  --model_args model_version=gpt-4.1-mini \
  --tasks mmmu_val,video_mmmu \
  --use_cache ./eval_cache

Task Catalog

The Current Tasks page lists every registered evaluation task across all modalities. The framework ships with 100+ tasks across five categories:

CategoryBenchmarkTask NameWhat It Tests
ImageMMMUmmmu_valCollege-level multimodal reasoning across 30 subjects.
MMEmmePerception and cognition across 14 subtasks.
MathVistamathvista_testminiMathematical reasoning with visual context.
MMBenchmmbench_enMulti-ability benchmark covering 20 fine-grained skills.
VideoVideo-MMMUvideo_mmmuKnowledge acquisition from multi-discipline professional videos.
VideoMMEvideommeComprehensive video understanding across diverse content types.
EgoSchemaegoschemaLong-form egocentric video reasoning.
Long VideoLongVideoBenchlongvideobench_val_vExtended video content with temporal reasoning.
AudioAIR-Benchair_bench_chatAudio understanding across speech, music, and sound.
LibriSpeechlibrispeechAutomatic speech recognition accuracy.
AgenticVending-Bench 2vending_bench2Multi-step tool-use with deterministic vending simulator (generate_until_agentic).
τ2-Benchtau2_bench_telecomTelecom-domain tool-use with state tracking and \<tool_call\>/\<submit\> protocol.

Release Notes

Each release note documents new tasks, models, architectural changes, and migration steps.

VersionThemeHighlights
v0.7Operational simplicityYAML-first config, reasoning-tag stripping, Lance-backed video, skill-based agent workflows.
v0.6Evaluation as a serviceAsync HTTP server, adaptive API concurrency (~7.5x throughput), statistical rigor (CI, paired t-test).
v0.5Audio expansionComprehensive audio evaluation, response caching, 50+ benchmark variants.
v0.4Scale and reasoningDistributed evaluation, reasoning benchmarks, unified chat interface.
v0.3Audio foundationsInitial audio model support (Qwen2-Audio, Gemini-Audio).

Additional Resources