Releases
CHANGELOG
Changelog
v0.7 (2026-02-27)
Highlights
- 25+ new benchmark tasks spanning document understanding, video, math, spatial, AGI, audio, and safety domains. (§1)
- Image/Video I/O throughput upgrade: unified
read_videowith TorchCodec backend (up to 3.58x faster), DALI GPU decode, shared encoding helpers, and LRU caching. (§3) - Lance-backed video mode: single Lance table on Hugging Face. Local-file, Lance-blob, and YouTube-URL resolution with priority fallback. (§4)
- Safety and red-teaming baseline:
safety_redteamgroup with jailbreak ASR, refusal rate, toxicity, and over-refusal metrics. (§5) - Efficiency metrics coverage: per-sample token counts, run-level throughput, and TTFT/TPOT on vLLM backends. (§6)
- Agentic task evaluation: new
generate_until_agenticoutput type with iterative tool-call loop, deterministic simulators, and trace-level metrics. (§8) - YAML config-driven evaluation: single
--configYAML file replaces fragile CLI one-liners. Schema validation, env expansion, batch configs, full reproducibility viaresolved_cli_args. (§9) - Pipeline-level reasoning tag stripping:
\<think\>blocks stripped inevaluator.pyafter filters, beforeprocess_results(). All backends, configurable via--reasoning_tags. (§10) - Async OpenAI
message_format: replacesis_qwen3_vlflag andasync_openai_qwen3_vlclass.generate_until()decomposed into 7 focused methods. (§11) - Flattened JSONL log output:
generate_untilresponses flattened from doubly-nested to singly-nested lists. Multi-choice tasks unchanged. (§12)
New Benchmark Tasks
Document understanding:
Video:
Math & reasoning:
- MathCanvas (#1161), MathKangaroo (#1158), VisuLogic (#1159), LLaVA-OV 1.5 RL reasoning collection (#1208)
Spatial & counting:
Knowledge & QA:
- SimpleVQA (#1184), WorldVQA (#1168), MTVQA (#1167), HiPhO (#1186), MME-CC (#1185), VPCT (#1183), ZeroBench (#1182)
AGI & agentic:
- ARC-AGI-1, ARC-AGI-2, BrowseComp (#1190), Vending-Bench 2, τ2-Bench Telecom
Audio (v0.7 Audio Update, #1124):
- AMI (meeting transcription, train/validation/test splits)
- CN College Listen MCQ (Chinese college listening comprehension)
- DREAM TTS MCQ (dialogue-based listening comprehension)
- EuroPal ASR (European Parliament speech recognition, test/validation splits)
- Song Describer (music captioning, train/validation splits)
Safety:
- JailbreakBench harmful + benign splits (
safety_redteamgroup)
New Models
- NanoVLM (SigLIP2 + MLP projector + Qwen3-0.6B): chat-style evaluation with async multi-GPU inference via job-queue dispatch. (#1207)
- Async HF model: generic async multi-GPU worker backend for HuggingFace model families. Loads replicas on N GPUs with independent worker threads. (#1204)
Image/Video I/O Throughput Upgrade
- Unified
read_videoentry point: single function inload_video.pywithbackendparameter andLMMS_VIDEO_DECODE_BACKENDenv var. Supports PyAV (default), TorchCodec, and DALI (GPU). (Release notes §3) - TorchCodec multi-threaded decode: up to 3.58x faster than PyAV at 8 frames with
LMMS_VIDEO_TORCHCODEC_THREADS=8. Thread tuning required — default threads (0/1) regress up to +150%. - DALI GPU decode: optional GPU-accelerated backend via
LMMS_VIDEO_DALI_DEVICE=gpu. - Shared image encoding:
encode_image_to_base64consolidated across protocol and simple adapters with path-metadata keyed cache. - Media resolver LRU caching:
_candidate_roots_cached(maxsize=256)and_extension_variants(maxsize=4096)cache deterministic path-derivation work.Path.exists()checks remain uncached for correctness. - PyAV stream fallback:
seek(0)before packet decode on stream failure. Set-membership lookup in frame selection. Preallocated output array fill replaceslist+np.stack. - LongVideoBench validation: decode latency
2.79s->1.02s(-63%, 2.7x speedup) with score reproducibility confirmed.
Lance-Backed Video Mode
- Lance table distribution: MINERVA videos stored as
video_blobin a Lance table on Hugging Face (lmms-lab-eval/minerva). Single-package reproducible distribution. (Release notes §4) - Resolution priority: local file (
MINERVA_VIDEO_DIR) > Lance blob (MINERVA_LANCE_VIDEO_URI) > YouTube URL fallback. - Build tooling:
tools/minerva_to_lance.pyconverts local video files to Lance tables.tools/bench_minerva_video_resolution.pyandtools/bench_minerva_pipeline_latency.pyfor latency benchmarking.
Safety and Red-Teaming Baseline
safety_redteamtask group:safety_jailbreakbench_harmfulandsafety_jailbreakbench_benignfromJailbreakBench/JBB-Behaviors. (Release notes §5)- Harmful split metrics: jailbreak ASR, refusal rate, toxicity score, content filter rejection rate, demographic/non-demographic refusal rates.
- Benign split metrics: over-refusal rate, benign toxicity score, content filter rejection rate, demographic/non-demographic refusal rates.
- Dual toxicity backends: Perspective API when
PERSPECTIVE_API_KEYis set; offline keyword heuristic fallback.
Efficiency Metrics Coverage
- Per-sample token counts:
input_tokens,output_tokens,reasoning_tokens(when backend metadata exists) in evaluation outputs. (Release notes §6) (#1125) - Run-level throughput:
total_gen_tokens,total_elapsed_time,avg_speedin results JSON. - TTFT/TPOT: available on
vllmchat backends via native runtime metrics. Not available on SGLang, OpenAI API, or HuggingFace local backends (wall-clock throughput only). - Token-based efficiency output:
efficiency.overall.tokens_per_correct_answer,efficiency.overall.avg_output_tokens_per_sample, and per-task breakdown. No provider-specific pricing dependencies.
Skill-Based Agent Workflows
- Repository skill:
skills/lmms-eval-guide/SKILL.mdwith reference files for models, tasks, and API server integration. Turns lmms-eval into a reusable operational skill for coding agents. (Release notes §7) - Dispatch routing: model/task extension via skill references; training integration via HTTP service; quick validation via
--limitsmoke tests.
Agentic Task Evaluation
generate_until_agenticoutput type: iterative evaluator loop where the model emits\<tool_call\>or\<submit\>tags anddoc_to_textexecutes tools against deterministic Python simulators. Configurable viamax_agentic_steps. (Release notes §8)- Agentic tasks:
vending_bench2(vending machine operation, 4 tools, 10 steps) andtau2_bench_telecom(telecom support, 4 tools, 8 steps). No external sandbox required. - Trace metrics: success rate, trace step validity, state progress, termination quality, and composite trace quality.
Better One-Line Evaluation Support
--configflag: one YAML file replaces long CLI commands. Maps directly to CLI arguments plus anenvsection for credentials and paths. (Release notes §9)- Environment variable expansion:
\{VAR\}` expands from shell;{VAR:-default}provides fallback. Keys containingKEY,TOKEN,SECRET, orPASSWORD` are auto-masked in log output. - Override priority: defaults < YAML < CLI. CLI arguments always win, enabling a canonical config with per-run overrides.
- Schema validation: unknown keys in YAML now raise an error listing valid keys. Typos like
modleno longer silently pass. - Batch evaluation: YAML accepts a list of configs for multi-model runs in a single invocation.
- Full reproducibility: results JSON includes
resolved_cli_args— the complete merged configuration. Reconstruct the exact experiment from any results file. - Web UI integration: export/import YAML configs from the Web UI. Round-trip is lossless.
Pipeline-Level Reasoning Tag Stripping
- Pipeline-level stripping (
lmms_eval/api/reasoning.py):strip_reasoning_tags()runs inevaluator.pyafter the filter pipeline and beforeprocess_results(). All model backends benefit without per-model changes. (Release notes §10) - Dual output preservation:
respsstores raw output (with\<think\>blocks) for chain-of-thought analysis;filtered_respsstores the clean scored text.respsomitted when identical tofiltered_resps. - CLI control:
--reasoning_tagsacceptsnone(disable), default\<think\>...</think>, or custom JSON pairs like'[["\<think\>", "</think>"], ["\<reasoning\>", "</reasoning>"]]'. - Per-task override: set
reasoning_tagsin task YAML to override the CLI flag per-task. - Removed ad-hoc handling: deleted
parse_reasoning_model_answercalls from 6 model files.
Support customized message_format in async_openai
message_formatparameter: replacesis_qwen3_vlflag and the separateasync_openai_qwen3_vlmodel class. Currently supportsopenai(default) andqwen3_vl(per-frame timestamps for video). New formats require only anelifinprepare_messages(). (Release notes §11)- Decomposed
generate_until(): 130-line monolith split into_build_video_kwargs(),prepare_messages(),_get_initial_concurrency(),_compute_dispatch_order(),_process_with_retry(),_update_concurrency(),_run_scheduling_loop(). Main method is now 8 lines. _AdaptiveConcurrencyTracker: concurrency tracking state moved from scatterednonlocalclosures to a dedicated dataclass.
Flattened JSONL Log Output
- Single-instance flattening:
generate_untilresponses flattened from[["text"]]to["text"]at serialization time. Multi-choiceloglikelihoodtasks with N instances per doc remain[["a"], ["b"], ...]. (Release notes §12) - Dedup preserved:
respsomitted when identical tofiltered_respsafter flattening. - In-memory format unchanged:
logged_samplesretains nested format for existing consumers (wandb logger, etc.).
CLI & UX
- Subcommand architecture (
lmms_eval/cli/): modular dispatch witheval,tasks,models,ui,serve,power,versionsubcommands. Interactive wizard whenevalis invoked without arguments. (#1202) - Branded evaluation banner: startup banner showing version metadata and environment info.
- External usage guide: new
docs/external_usage.mdcovering CLI subcommands and Python library access.
Bug Fixes
- Fix image vs video file path detection in
auto_doc_to_messagesfallback - Align
osworld_gpolygon scoring with osworld-verified annotations (#1165) - Harden
mmsi-benchutils parsing against malformed model responses (#1162) - Fix Whisper
world_sizeinitialization for single-process runtime (#1124) - Fix Audio Flamingo 3 parameter handling (#1124)
- Restore
task_input_specs/redundancy_refactor.yamlaccidentally deleted by test commit
Refactoring
- Test suite modernization: migrated all test files from
unittest.TestCaseto pure pytest style (test_protocol,test_construct_requests,test_evaluator,test_task_pipeline, CLI tests). Added prompt stability tests and benchmark registration coverage. - Video loader cleanup: renamed
read_video_pyav->read_videowith backward-compat alias; removed deadread_video_pyav_pil,read_video_pyav_base64, and_resize_imagefunctions; updated all 12 caller files. - Reasoning utils deduplication: consolidated repeated reasoning utility patterns into factory functions across task modules.
- Removed
use_custom_video_loaderdead code from 5 model files (qwen2_5_vl, qwen3_vl, qwen3_omni, llava_onevision1_5, huggingface). - Async OpenAI internal decomposition:
generate_until()split into 7 focused methods with_AdaptiveConcurrencyTrackerdataclass.
Infrastructure & Documentation
- Comprehensive test suite README with concrete code examples for each test category.
- Rewritten
docs/README.mdwith pipeline diagram, code examples, and categorized table of contents. - External usage guide (
docs/external_usage.md) for CLI subcommands and Python library API. - Updated example scripts: switched to OpenAI API model, promoted MMMU/VideoMMU/LongVideoBench as recommended benchmarks.
- SpatialTreeBench naming variant documentation (#1189).
- Agent skill files (
skills/lmms-eval-guide/) for model, task, and API server integration workflows. - Example YAML configs (
configs/) for local, API, and batch evaluation patterns.
v0.6 (2026-02-16)
Highlights
- ~7.5x API throughput improvement via adaptive concurrency control, refill scheduling, prefix-aware queueing, and retry/backoff decoupling
- Statistical analysis framework: confidence intervals, clustered standard errors, paired comparison, power analysis, and model stability metrics
- Evaluation as a Service: HTTP eval server for async job submission decoupled from training loops
- Model Registry V2: manifest-driven unified model resolution with backward-compatible aliasing
- 50+ new evaluation tasks and 10+ new model integrations
- Minimum Python version: raised to 3.10
Architecture & Performance
- Adaptive concurrency control for API-backed evaluation (
async_openai,openai). The controller adjusts in-flight request count using three online signals: failure rate, rate-limit hit rate (429/throttling), and p95 latency against a target budget. Measured ~7.5x throughput gain over static single-concurrency baseline onmmewithLIMIT=100. (#1080, #1082) - Refill-style scheduling: completed requests immediately release slots for new work, eliminating the full-window barrier where the slowest request gates the entire batch.
- Prefix-aware queueing: reorder dispatch by prefix hash so same-prefix requests are sent close together, improving prefill-cache hit opportunities on providers that support prefix caching. (#1080)
- Retry/backoff decoupling:
retry_backoff_sis explicitly separate from request timeout, so retries don't sleep for long timeouts and tie up worker slots. - Throughput metrics in results table: final output now includes requests/sec and wall time for each task. (#1078)
--offsetoption: skip the first N samples in a dataset, useful for resuming partial runs or debugging specific subsets. (#1042)
Model Registry V2
- Manifest-driven model resolution (
ModelRegistryV2): all model names resolve through a single path. Two dicts inmodels/__init__.py(AVAILABLE_SIMPLE_MODELS,AVAILABLE_CHAT_TEMPLATE_MODELS) declare available models, merged intoModelManifestobjects at startup. Chat is always preferred over simple unlessforce_simple=True. (#1070) - Unified OpenAI model naming: canonical names shortened from
openai_compatible/async_openai_compatibletoopenai/async_openai. Old names continue to work as aliases viaMODEL_ALIASES. File renames:chat/openai_compatible.py->chat/openai.py,simple/openai_compatible.py->simple/openai.py. (#1083, #1084) - Simple mode deprecation: the
doc_to_visual+doc_to_textinterface for API models is deprecated. New integrations should usedoc_to_messages+ChatMessages.
Statistical Analysis
- CLT and clustered standard error estimation: for benchmarks with correlated questions (e.g., multiple questions per video), specify
cluster_keyin task YAML to apply cluster-robust SE correction. Clustered SE can be 3x larger than naive estimates. Output format:score +/- 1.96 x SE(95% CI). (#989) - Baseline-anchored paired comparison: paired t-test on per-question differences
d_i = score_A - score_B, removing question difficulty variance to isolate model difference signal. Reportsmean_diff, CI, and p-value. (#1006) - Power analysis: compute minimum sample size to detect a given effect size (e.g., 2% improvement) before running an evaluation. Rule of thumb: reliable benchmarks need
n > 1000. (#1007) - Model stability metrics: run N samples per question (temp=0.7), report expected accuracy (EA), consensus accuracy (CA), internal variance (IV), and consistency rate (CR).
- Decontamination probing: settings for detecting potential data contamination in video benchmarks. (#990)
Evaluation as a Service
- HTTP eval server: FastAPI-based server with endpoints for job submission (
/evaluate), status polling (/jobs/\{id\}), queue management (/queue), and resource discovery (/tasks,/models). IncludesJobSchedulerfor sequential GPU resource management. (#972) - Client libraries:
EvalClient(sync) andAsyncEvalClient(async) for programmatic job submission from training loops. - Web UI: React + FastAPI web interface replacing the terminal TUI, with model/task selection, real-time command preview, and live output streaming. (#1001)
New Tasks
Spatial & 3D reasoning:
- 3DSR (#1072), Spatial457 (#1031), SpatialTreeBench (#994), ViewSpatial (#983), OmniSpatial (#896)
- SiteBench (#984, multi-image #996), VSIBench (debiased & pruned #975, multi-image #993)
- Blink, CV_Bench, Embspatial, ERQA (#927), RefSpatial, Where2Place (#940)
- SpatialViz (#894)
Knowledge & reasoning:
- CoreCognition (#1064), MMSU (#1058), Uni-MMMU (#1029), Geometry3K (#1030)
- AuxSolidMath (#1034), MindCube (#876), MMVP (#1028), RealUnify (#1033)
- IllusionBench (#1035), MME-SCI (#878), VLMs are Blind (#931), VLMs are Biased (#928)
- Reasoning task versions for multiple benchmarks (#926, #1038)
- VLMEvalKit-compatible Qwen task variants for MMMU and MMStar (#1021)
Video & streaming:
- MMSI-Video-Bench (#1053), OVOBench (#957), Mantis-Eval (#978)
- LongVT for long video with tool calling (#944), SciVideoBench (#875)
Multimodal & other:
- PRISMM-Bench (#1063), OSI-bench (#1068), mmar (#1057), PAIBench-U (#1050)
- SPAR-bench (#1011), BabyVision Gen (#1010) + Und (#1015)
- AV-SpeakerBench (#943), imgedit bench (#941), MMSearch-Plus (#1054)
- CaptionQA (#1004), StructEditBench (#1016), kris_bench (#1017)
- FALCON-Bench (#942), UEval (#890), SeePhys (#903), SNSBench (#930)
- STARE (#893), GroundingMe (#949), GEditBench (#939), JMMMU-Pro (#937)
- WenetSpeech test_net split (#1027)
New Models
- GLM4V, LLaMA 4 (#1056)
- OmniVinci, MiniCPM-o-2_6 (#1060)
- Uni-MoE-2.0-Omni, Baichuan-Omni-1d5 (#1059)
- Audio Flamingo 3, Kimi Audio (#1055)
- InternVL-HF (#1039), InternVL3, InternVL3.5 (#963)
- Bagel UMM (#1012), Cambrian-S (#977)
- Qwen3-VL (#883), Qwen3-Omni, Video-Salmonn-2 (#955)
- LLaVA-OneVision-1.5 chat interface (#887)
- Multi-round generation (
generate_until_multi_round) for Qwen2.5-VL and Qwen2-VL (#960)
Bug Fixes
- Raise minimum supported Python version to 3.10 (#1079)
- Fix video loader memory leaks via resource cleanup (#1026)
- Replace hardcoded
.cuda()with.to(self._device)for multi-GPU support (#1024) - Fix Qwen2.5-VL nframes edge case (#992, #987)
- Fix multi-image token insertion for Cambrians model (#1075)
- Add dynamic
max_numcalculation to InternVL3 (#1069) - Fix
partialsupport in VSIBench metric calculation (#1041) - Fix Qwen2-Audio parameter name error (#1081)
- Fix InternVL3 duplicate
\<image\>token issue (#999) - Fix hallusionbench processing for distributed eval (#885)
- Fix COCO Karpathy test data loading (#884)
- Fix nested dictionary input for vLLM
mm_processor_kwargs(#915) - Fix log_samples missing fields in doc (#731)
- Fix catastrophic backtracking in Charades eval regex
- Filter multimodal content from log samples while preserving metadata (#962)
- Fix Qwen2.5-VL batch size > 1 visual alignment (#971)
Infrastructure & Documentation
- Developer guidance for AI agents and contributors (AGENTS.md) (#1085)
- Restructured v0.6 release notes: top-down architecture overview
- README: reordered sections by user journey, simplified header
- Added CITATION.cff, FAQ, quickstart guide
- i18n README translations for 18 languages (#979)
- Scalable choice selection for evaluation (#1005)
- Use dependency lower bounds for broader compatibility (#969)