lmms-eval

Current Tasks

Current Tasks

() indicates the task name in lmms_eval. The task name is used to specify the dataset in the configuration file.

Note: This documentation is manually maintained. For the most up-to-date and complete list of supported tasks, please run:

python -m lmms_eval --tasks list

To see the number of questions in each task:

python -m lmms_eval --tasks list_with_num

(Note: list_with_num will download all datasets and may require significant time and storage)


Summary Statistics

ModalityTask Count
Image Understanding & VQA60+
Multi-image Tasks15+
Video Understanding25+
Long Video & Temporal10+
Audio & Speech20+
Document Understanding20+
Mathematical Reasoning12+
Spatial & Grounding10+
Text-only Language Tasks15+
Total198+

1. Image Tasks

Core VQA & Understanding Benchmarks

MMBench Family

  • MMBench (mmbench)
    • MMBench English (mmbench_en)
      • MMBench English Dev (mmbench_en_dev)
      • MMBench English Test (mmbench_en_test)
    • MMBench Chinese (mmbench_cn)
      • MMBench Chinese Dev (mmbench_cn_dev)
      • MMBench Chinese Test (mmbench_cn_test)
  • MME (mme)
  • MME-CoT (mme_cot)
  • MME-RealWorld (mmerealworld)
    • MME-RealWorld English (mmerealworld)
    • MME-RealWorld Mini (mmerealworld_lite)
    • MME-RealWorld Chinese (mmerealworld_cn)
  • MME-SCI (mme_sci)
    • MME-SCI (mme_sci)
    • MME-SCI-Image (mme_sci_images)
  • MMRefine (mmrefine)
  • MMStar (mmstar)
  • MMUPD (mmupd)
    • MMUPD Base (mmupd_base)
      • MMAAD Base (mmaad_base)
      • MMIASD Base (mmiasd_base)
      • MMIVQD Base (mmivqd_base)
    • MMUPD Option (mmupd_option)
      • MMAAD Option (mmaad_option)
      • MMIASD Option (mmiasd_option)
      • MMIVQD Option (mmivqd_option)
    • MMUPD Instruction (mmupd_instruction)
      • MMAAD Instruction (mmaad_instruction)
      • MMIASD Instruction (mmiasd_instruction)
      • MMIVQD Instruction (mmivqd_instruction)
  • MMVet (mmvet)
  • MMVet v2 (mmvetv2)
  • MMVU (mmvu)
  • MMWorld (mmworld)
  • MTVQA (mtvqa)
  • MMSI-Bench (mmsi_bench)
  • MMSearch (mmsearch)

Hallucination & Bias Evaluation

Safety & Red-Teaming

  • JailbreakBench Behaviors (safety_redteam)
    • Harmful split (safety_jailbreakbench_harmful)
    • Benign split (safety_jailbreakbench_benign)

Multilingual Benchmarks

  • Multilingual LLaVA Bench
    • llava_in_the_wild_arabic
    • llava_in_the_wild_bengali
    • llava_in_the_wild_chinese
    • llava_in_the_wild_french
    • llava_in_the_wild_hindi
    • llava_in_the_wild_japanese
    • llava_in_the_wild_russian
    • llava_in_the_wild_spanish
    • llava_in_the_wild_urdu
  • VCR-Wiki
    • VCR-Wiki English
      • VCR-Wiki English easy 100 (vcr_wiki_en_easy_100)
      • VCR-Wiki English easy 500 (vcr_wiki_en_easy_500)
      • VCR-Wiki English easy (vcr_wiki_en_easy)
      • VCR-Wiki English hard 100 (vcr_wiki_en_hard_100)
      • VCR-Wiki English hard 500 (vcr_wiki_en_hard_500)
      • VCR-Wiki English hard (vcr_wiki_en_hard)
    • VCR-Wiki Chinese
      • VCR-Wiki Chinese easy 100 (vcr_wiki_zh_easy_100)
      • VCR-Wiki Chinese easy 500 (vcr_wiki_zh_easy_500)
      • VCR-Wiki Chinese easy (vcr_wiki_zh_easy)
      • VCR-Wiki Chinese hard 100 (vcr_wiki_zh_hard_100)
      • VCR-Wiki Chinese hard 500 (vcr_wiki_zh_hard_500)
      • VCR-Wiki Chinese hard (vcr_wiki_zh_hard)

Quality & Low-level Vision

  • Q-Bench (qbenchs_dev)
    • Q-Bench2-HF (qbench2_dev)
    • Q-Bench-HF (qbench_dev)
    • A-Bench-HF (abench_dev)
  • SalBench (salbench)
    • p3, p3_box, p3_box_img
    • o3, o3_box, o3_box_img
  • LV-Bench (lvbench)
  • MMVP (mmvp)
  • VSI-Bench (vsibench)
  • HR-Bench (hrbench)

Specialized Vision Tasks


2. Multi-image Tasks

  • CMMMU (cmmmu)
    • CMMMU Validation (cmmmu_val)
    • CMMMU Test (cmmmu_test)
  • HallusionBench (hallusion_bench_image)
  • ICON-QA (iconqa)
    • ICON-QA Validation (iconqa_val)
    • ICON-QA Test (iconqa_test)
  • JMMMU (jmmmu)
  • JMMMU-Pro (jmmmu_pro)
  • LLaVA-NeXT-Interleave-Bench (llava_interleave_bench)
    • llava_interleave_bench_in_domain
    • llava_interleave_bench_out_domain
    • llava_interleave_bench_multi_view
  • MIRB (mirb)
  • MMLongBench (mmlongbench)
  • MMMU (mmmu)
    • MMMU Validation (mmmu_val)
    • MMMU Test (mmmu_test)
  • MMMU_Pro (mmmu_pro)
    • MMMU Pro (mmmu_pro)
      • MMMU Pro Original (mmmu_pro_original)
      • MMMU Pro Vision (mmmu_pro_vision)
    • MMMU Pro COT (mmmu_pro_cot)
      • MMMU Pro Original COT (mmmu_pro_original_cot)
      • MMMU Pro Vision COT (mmmu_pro_vision_cot)
      • MMMU Pro Composite COT (mmmu_pro_composite_cot)
  • MMT Multiple Image (mmt_mi)
    • MMT Multiple Image Validation (mmt_mi_val)
    • MMT Multiple Image Test (mmt_mi_test)
  • MuirBench (muirbench)
  • MP-DocVQA (multidocvqa)
    • MP-DocVQA Validation (multidocvqa_val)
    • MP-DocVQA Test (multidocvqa_test)
  • OlympiadBench (olympiadbench)
    • OlympiadBench Test English (olympiadbench_test_en)
    • OlympiadBench Test Chinese (olympiadbench_test_cn)
  • OlympiadBench MIMO (olympiadbench_mimo)
  • MEGA-Bench (megabench)
    • MEGA-Bench Core (megabench_core)
    • MEGA-Bench Open (megabench_open)
    • MEGA-Bench Core single-image subset (megabench_core_si)
    • MEGA-Bench Open single-image subset (megabench_open_si)

3. Video Tasks

General Video Understanding

  • ActivityNet-QA (activitynetqa_generation)
  • CVRR-ES (cvrr)
    • cvrr_continuity_and_object_instance_count
    • cvrr_fine_grained_action_understanding
    • cvrr_interpretation_of_social_context
    • cvrr_interpretation_of_visual_context
    • cvrr_multiple_actions_in_a_single_video
    • cvrr_non_existent_actions_with_existent_scene_depictions
    • cvrr_non_existent_actions_with_non_existent_scene_depictions
    • cvrr_partial_actions
    • cvrr_time_order_understanding
    • cvrr_understanding_emotional_context
    • cvrr_unusual_and_physically_anomalous_activities
  • CinePile (cinepile)
  • EgoSchema (egoschema)
    • egoschema_mcppl
    • egoschema_subset_mcppl
    • egoschema_subset
  • EgoPlan (egoplan)
  • EgoTempo (egotempo)
  • EgoThink (egothink)
  • MLVU (mlvu)
  • MMT-Bench (mmt)
    • MMT Validation (mmt_val)
    • MMT Test (mmt_test)
  • MVBench (mvbench)
    • mvbench_action_sequence
    • mvbench_moving_count
    • mvbench_action_prediction
    • mvbench_episodic_reasoning
    • mvbench_action_antonym
    • mvbench_action_count
    • mvbench_scene_transition
    • mvbench_object_shuffle
    • mvbench_object_existence
    • mvbench_fine_grained_pose
    • mvbench_unexpected_action
    • mvbench_moving_direction
    • mvbench_state_change
    • mvbench_object_interaction
    • mvbench_character_order
    • mvbench_action_localization
    • mvbench_counterfactual_inference
    • mvbench_fine_grained_action
    • mvbench_moving_attribute
    • mvbench_egocentric_navigation
  • TVBench (tvbench)
    • tvbench_action_antonym
    • tvbench_action_count
    • tvbench_action_localization
    • tvbench_action_sequence
    • tvbench_egocentric_sequence
    • tvbench_moving_direction
    • tvbench_object_count
    • tvbench_object_shuffle
    • tvbench_scene_transition
    • tvbench_unexpected_action
  • MotionBench (motionbench)
    • motionbench_full
  • NExT-QA (nextqa)
    • NExT-QA Multiple Choice Test (nextqa_mc_test)
    • NExT-QA Open Ended Validation (nextqa_oe_val)
    • NExT-QA Open Ended Test (nextqa_oe_test)
  • PerceptionTest (perceptiontest)
    • PerceptionTest Test
      • perceptiontest_test_mc
      • perceptiontest_test_mcppl
    • PerceptionTest Validation
      • perceptiontest_val_mc
      • perceptiontest_val_mcppl
  • PLM VideoBench (plm_videobench)
  • SciVideoBench (scivideobench)
  • MINERVA (minerva)
  • Video-ChatGPT (videochatgpt)
    • Video-ChatGPT Generic (videochatgpt_gen)
    • Video-ChatGPT Temporal (videochatgpt_temporal)
    • Video-ChatGPT Consistency (videochatgpt_consistency)
  • Video-MME (videomme)
  • Video-MMMU (videommmu)
  • VideoEval-Pro (videoevalpro)
  • VideoMathQA (videomathqa)
  • Vinoground (vinoground)
  • WorldQA (worldqa)
    • WorldQA Generation (worldqa_gen)
    • WorldQA Multiple Choice (worldqa_mc)
  • WorldVQA (worldvqa)
    • WorldQA Compatibility Generation (worldvqa_gen)
    • WorldQA Compatibility Multiple Choice (worldvqa_mc)
  • YouCook2 (youcook2_val)

Long Video & Temporal Understanding

  • Charades-STA (charades_sta)
  • FALCON-Bench (FALCONBench) - One-hour-long video understanding
  • LEMONADE (lemonade)
  • LongTimescope (longtimescope)
  • LongVT (longvt) - Tool-based long video understanding
  • LongVideoBench (longvideobench)
  • NEPTUNE (neptune)
    • Video-path subsets: neptune_full_v, neptune_mma_v, neptune_mmh_v
    • Frame-sampled subsets: neptune_full_i, neptune_mma_i, neptune_mmh_i
    • Example: python -m lmms_eval --model qwen2_5_vl --tasks neptune_full_v --limit 5 --batch_size 1
  • MovieChat (moviechat)
    • Global Mode for entire video (moviechat_global)
    • Breakpoint Mode for specific moments (moviechat_breakpoint)
  • TempCompass (tempcompass)
    • tempcompass_multi_choice
    • tempcompass_yes_no
    • tempcompass_caption_matching
    • tempcompass_captioning
  • TemporalBench (temporalbench)
    • temporalbench_short_qa
    • temporalbench_long_qa
    • temporalbench_short_caption
  • Timescope (timescope)

Video Captioning & Description

  • Vatex (vatex)
    • Vatex Chinese (vatex_val_zh)
    • Vatex Test (vatex_test)
  • VDC (vdc)
    • VDC Detailed Caption (detailed_test)
    • VDC Camera Caption (camera_test)
    • VDC Short Caption (short_test)
    • VDC Background Caption (background_test)
    • VDC Main Object Caption (main_object_test)
  • VideoDetailDescription (video_dc499)
  • Video-TT (video-tt)
  • VITATECS (vitatecs)
    • VITATECS Direction (vitatecs_direction)
    • VITATECS Intensity (vitatecs_intensity)
    • VITATECS Sequence (vitatecs_sequence)
    • VITATECS Compositionality (vitatecs_compositionality)
    • VITATECS Localization (vitatecs_localization)
    • VITATECS Type (vitatecs_type)

4. Audio & Speech Tasks

Speech Recognition

Speech Translation

Audio Understanding & QA


5. Document Understanding Tasks

  • DOCVQA (docvqa)
    • DOCVQA Validation (docvqa_val)
    • DOCVQA Test (docvqa_test)
  • DUDE (dude)
  • MMLongBench-Doc (mmlongbench_doc)
  • GEdit-Bench (gedit_bench)
  • Infographic VQA (infovqa)
    • Infographic VQA Validation (infovqa_val)
    • Infographic VQA Test (infovqa_test)
  • OfficeQA (officeqa)
  • OCRBench (ocrbench)
  • OCRBench v2 (ocrbench_v2)
  • OmniDocBench (omnidocbench)
  • PRISMM-Bench (prismm_bench)
    • PRISMM-Bench Identification (prismm_bench_identification)
    • PRISMM-Bench Identification Whole Page Context (prismm_bench_identification_whole_page)
    • PRISMM-Bench Identification Whole Document Context (prismm_bench_identification_whole_doc)
    • PRISMM-Bench Remedy (prismm_bench_remedy)
    • PRISMM-Bench Remedy Whole Page Context (prismm_bench_remedy_whole_page)
    • PRISMM-Bench Remedy Whole Document Context(prismm_bench_remedy_whole_doc)
    • PRISMM-Bench Pair Match (prismm_bench_pair_match)
  • ScreenSpot (screenspot)
    • ScreenSpot REC / Grounding (screenspot_rec)
    • ScreenSpot REG / Instruction Generation (screenspot_reg)
  • ST-VQA (stvqa)
  • SynthDog (synthdog)
    • SynthDog English (synthdog_en)
    • SynthDog Chinese (synthdog_zh)
  • TextCaps (textcaps)
    • TextCaps Validation (textcaps_val)
    • TextCaps Test (textcaps_test)
  • TextVQA (textvqa)
    • TextVQA Validation (textvqa_val)
    • TextVQA Test (textvqa_test)
  • WebSRC (websrc)
    • WebSRC Validation (websrc_val)
    • WebSRC Test (websrc_test)

6. Mathematical Reasoning Tasks

  • AIME (aime)
  • DynaMath (dynamath)
  • GSM8K (gsm8k)
  • MathCanvas (mathcanvas)
    • MathCanvas Algebra (mathcanvas_algebra)
    • MathCanvas Analytic Geometry (mathcanvas_analytic_geometry)
    • MathCanvas Calculus and Vector (mathcanvas_calculus_and_vector)
    • MathCanvas Plane Geometry (mathcanvas_plane_geometry)
    • MathCanvas Solid Geometry (mathcanvas_solid_geometry)
    • MathCanvas Statistics (mathcanvas_statistics)
    • MathCanvas Transformational Geometry (mathcanvas_transformational_geometry)
    • MathCanvas Trigonometry (mathcanvas_trigonometry)
  • MathKangaroo (mathkangaroo)
  • MathVerse (mathverse)
    • MathVerse Text Dominant (mathverse_testmini_text_dominant)
    • MathVerse Text Only (mathverse_testmini_text_only)
    • MathVerse Text Lite (mathverse_testmini_text_lite)
    • MathVerse Vision Dominant (mathverse_testmini_vision_dominant)
    • MathVerse Vision Intensive (mathverse_testmini_vision_intensive)
    • MathVerse Vision Only (mathverse_testmini_vision_only)
  • MathVision (mathvision)
    • MathVision TestMini (mathvision_testmini)
    • MathVision Test (mathvision_test)
    • MathVision Reason TestMini (mathvision_reason_testmini)
    • MathVision Reason Test (mathvision_reason_test)
  • MathVista (mathvista)
    • MathVista Validation (mathvista_testmini)
    • MathVista Test (mathvista_test)
  • OpenAI Math (openai_math)
  • SciBench (scibench)
  • WeMath (wemath)

7. Spatial & Grounding Tasks

Referring Expression Comprehension

  • Ferret (ferret)
  • OSWorld-Verified (OSWorld-G) (osworld_g)
  • RefCOCO (refcoco)
    • refcoco_seg_test, refcoco_seg_val
    • refcoco_seg_testA, refcoco_seg_testB
    • refcoco_bbox_test, refcoco_bbox_val
    • refcoco_bbox_testA, refcoco_bbox_testB
  • RefCOCO+ (refcoco+)
    • refcoco+_seg_val, refcoco+_seg_testA, refcoco+_seg_testB
    • refcoco+_bbox_val, refcoco+_bbox_testA, refcoco+_bbox_testB
  • RefCOCOg (refcocog)
    • refcocog_seg_test, refcocog_seg_val
    • refcocog_bbox_test, refcocog_bbox_val
  • RefSpatial (refspatial)

Spatial Reasoning


8. Text-only Language Tasks


9. Multimodal Evaluation & Meta-benchmarks


10. Supported Models

Model NameClassDescription
qwen2_5_vlQwen2_5_VLQwen2.5-VL vision-language model
qwen3_vlQwen3_VLQwen3-VL vision-language model
llava_hfLlavaHfLLaVA via Hugging Face
llava_onevision1_5Llava_OneVision1_5LLaVA-OneVision 1.5
thymeThymeThyme multimodal model
longvilaLongVilaLong video understanding model
bagel_lmms_engineBagelLmmsEngineBagel LMMS engine
vllmVLLMvLLM backend
vllm_generateVLLMGeneratevLLM generation mode
sglangSglangSGLang serving backend
huggingfaceHuggingfaceGeneric HuggingFace models
openaiOpenAICompatibleOpenAI-compatible APIs (aliases: openai_compatible, openai_compatible_chat)
async_openaiAsyncOpenAIChatAsync OpenAI chat (alias: async_openai_compatible_chat)

Simple/Legacy Models

Model NameClassModality
aeroAeroImage, Audio
ariaAriaImage, Video
auroracapAuroraCapImage, Video captioning
bagelBagelImage
batch_gpt4BatchGPT4API
claudeClaudeImage, Video
cogvlm2CogVLM2Image
egogptEgoGPTVideo
from_logFromLogUtility
fuyuFuyuImage
gemini_apiGeminiAPIImage, Audio
gemma3Gemma3Image
gpt4o_audioGPT4OAudioAudio, Vision API
gpt4vGPT4VImage, Video API
idefics2Idefics2Image
instructblipInstructBLIPImage
internvideo2InternVideo2Video
internvideo2_5InternVideo2_5Video
internvlInternVLChatImage
internvl2InternVL2Image
llama_vidLLaMAVidVideo
llama_visionLlamaVisionImage
llavaLlavaImage
llava_onevisionLlava_OneVisionImage, Video
llava_onevision_moviechatLlava_OneVision_MovieChatLong Video
llava_sglangLlavaSglangImage (vLLM)
llava_vidLlavaVidVideo
longvaLongVALong Video
mantisMantisMulti-image
minicpm_vMiniCPM_VImage
minimonkeyMiniMonkeyImage
moviechatMovieChatLong Video
mplug_owl_videomplug_OwlVideo
olaOlaMultimodal
oryxOryxMultimodal
phi3vPhi3vImage
phi4_multimodalPhi4Multimodal
plmPerceptionLMImage
qwen_vlQwen_VLImage
qwen_vl_apiQwen_VL_APIAPI
qwen2_5_omniQwen2_5_OmniImage, Video, Audio
qwen2_audioQwen2_AudioAudio
qwen2_vlQwen2_VLImage, Video
rekaRekaMultimodal API
rossRossMultimodal
slimeSlimeMultimodal
srt_apiSRT_APIAPI
tinyllavaTinyLlavaImage
videoChatGPTVideoChatGPTVideo
video_llavaVideoLLaVAVideo
videochat2VideoChat2Video
videochat_flashVideoChat_FlashVideo
videollama3VideoLLaMA3Video
vilaVILAImage, Video
vitaVITAMultimodal
voraVoRAMultimodal
whisperWhisperAudio
whisper_vllmWhisperVllmAudio
xcomposer2_4KHDXComposer2_4KHDHigh-resolution Image
xcomposer2d5XComposer2D5Image

Modality Support Summary

ModalityModel Count
Image50+
Video20+
Audio5+
Multimodal (2+ modalities)15+
API-based8+
Total Unique Models70+