Current Tasks
Current Tasks
() indicates the task name in lmms_eval. The task name is used to specify the dataset in the configuration file.
Note: This documentation is manually maintained. For the most up-to-date and complete list of supported tasks, please run:
python -m lmms_eval --tasks listTo see the number of questions in each task:
python -m lmms_eval --tasks list_with_num(Note: list_with_num will download all datasets and may require significant time and storage)
Summary Statistics
| Modality | Task Count |
|---|---|
| Image Understanding & VQA | 60+ |
| Multi-image Tasks | 15+ |
| Video Understanding | 25+ |
| Long Video & Temporal | 10+ |
| Audio & Speech | 20+ |
| Document Understanding | 20+ |
| Mathematical Reasoning | 12+ |
| Spatial & Grounding | 10+ |
| Text-only Language Tasks | 15+ |
| Total | 198+ |
1. Image Tasks
Core VQA & Understanding Benchmarks
- AI2D (ai2d)
- BLINK (blink)
- ChartQA (chartqa)
- CharXiv (charxiv)
- COCO Caption (coco_cap)
- COCO 2014 Caption (coco2014_cap)
- COCO 2014 Caption Validation (coco2014_cap_val)
- COCO 2014 Caption Test (coco2014_cap_test)
- COCO 2017 Caption (coco2017_cap)
- COCO 2017 Caption MiniVal (coco2017_cap_val)
- COCO 2017 Caption MiniTest (coco2017_cap_test)
- COCO 2014 Caption (coco2014_cap)
- ConBench (conbench)
- CountBench (countbench)
- CV-Bench (cv_bench)
- DetailCaps-4870 (detailcaps)
- FSC-147 (fsc147)
- Flickr30K (flickr30k)
- Flickr30K Test (flickr30k_test)
- GQA (gqa)
- GQA-ru (gqa_ru)
- II-Bench (ii_bench)
- IllusionVQA (illusionvqa)
- LiveBench (live_bench)
- LiveBench 06/2024 (live_bench_2406)
- LiveBench 07/2024 (live_bench_2407)
- LLaVA-Bench-Wilder (llava_wilder_small)
- LLaVA-Bench-COCO (llava_bench_coco)
- LLaVA-Bench (llava_in_the_wild)
- NaturalBench (naturalbench)
- NoCaps (nocaps)
- NoCaps Validation (nocaps_val)
- NoCaps Test (nocaps_test)
- OKVQA (ok_vqa)
- OKVQA Validation 2014 (ok_vqa_val2014)
- POPE (pope)
- RealWorldQA (realworldqa)
- ScienceQA (scienceqa_full)
- ScienceQA Full (scienceqa)
- ScienceQA IMG (scienceqa_img)
- SeedBench (seedbench)
- SeedBench 2 (seedbench_2)
- SeedBench 2 Plus (seedbench_2_plus)
- VibeEval (vibe_eval)
- VisuLogic (visulogic)
- VizWizVQA (vizwiz_vqa)
- VizWizVQA Validation (vizwiz_vqa_val)
- VizWizVQA Test (vizwiz_vqa_test)
- VQAv2 (vqav2)
- VQAv2 Validation (vqav2_val)
- VQAv2 Test (vqav2_test)
- WildVision-Bench (wildvision)
- WildVision 0617 (wildvision_0617)
- WildVision 0630 (wildvision_0630)
MMBench Family
- MMBench (mmbench)
- MMBench English (mmbench_en)
- MMBench English Dev (mmbench_en_dev)
- MMBench English Test (mmbench_en_test)
- MMBench Chinese (mmbench_cn)
- MMBench Chinese Dev (mmbench_cn_dev)
- MMBench Chinese Test (mmbench_cn_test)
- MMBench English (mmbench_en)
- MME (mme)
- MME-CoT (mme_cot)
- MME-RealWorld (mmerealworld)
- MME-RealWorld English (mmerealworld)
- MME-RealWorld Mini (mmerealworld_lite)
- MME-RealWorld Chinese (mmerealworld_cn)
- MME-SCI (mme_sci)
- MME-SCI (mme_sci)
- MME-SCI-Image (mme_sci_images)
- MMRefine (mmrefine)
- MMStar (mmstar)
- MMUPD (mmupd)
- MMUPD Base (mmupd_base)
- MMAAD Base (mmaad_base)
- MMIASD Base (mmiasd_base)
- MMIVQD Base (mmivqd_base)
- MMUPD Option (mmupd_option)
- MMAAD Option (mmaad_option)
- MMIASD Option (mmiasd_option)
- MMIVQD Option (mmivqd_option)
- MMUPD Instruction (mmupd_instruction)
- MMAAD Instruction (mmaad_instruction)
- MMIASD Instruction (mmiasd_instruction)
- MMIVQD Instruction (mmivqd_instruction)
- MMUPD Base (mmupd_base)
- MMVet (mmvet)
- MMVet v2 (mmvetv2)
- MMVU (mmvu)
- MMWorld (mmworld)
- MTVQA (mtvqa)
- MMSI-Bench (mmsi_bench)
- MMSearch (mmsearch)
Hallucination & Bias Evaluation
- HallusionBench (hallusion_bench_image)
- VLMs Are Biased (vlms_are_biased)
- VLMs Are Blind (vlmsareblind)
Safety & Red-Teaming
- JailbreakBench Behaviors (safety_redteam)
- Harmful split (safety_jailbreakbench_harmful)
- Benign split (safety_jailbreakbench_benign)
Multilingual Benchmarks
- Multilingual LLaVA Bench
- llava_in_the_wild_arabic
- llava_in_the_wild_bengali
- llava_in_the_wild_chinese
- llava_in_the_wild_french
- llava_in_the_wild_hindi
- llava_in_the_wild_japanese
- llava_in_the_wild_russian
- llava_in_the_wild_spanish
- llava_in_the_wild_urdu
- VCR-Wiki
- VCR-Wiki English
- VCR-Wiki English easy 100 (vcr_wiki_en_easy_100)
- VCR-Wiki English easy 500 (vcr_wiki_en_easy_500)
- VCR-Wiki English easy (vcr_wiki_en_easy)
- VCR-Wiki English hard 100 (vcr_wiki_en_hard_100)
- VCR-Wiki English hard 500 (vcr_wiki_en_hard_500)
- VCR-Wiki English hard (vcr_wiki_en_hard)
- VCR-Wiki Chinese
- VCR-Wiki Chinese easy 100 (vcr_wiki_zh_easy_100)
- VCR-Wiki Chinese easy 500 (vcr_wiki_zh_easy_500)
- VCR-Wiki Chinese easy (vcr_wiki_zh_easy)
- VCR-Wiki Chinese hard 100 (vcr_wiki_zh_hard_100)
- VCR-Wiki Chinese hard 500 (vcr_wiki_zh_hard_500)
- VCR-Wiki Chinese hard (vcr_wiki_zh_hard)
- VCR-Wiki English
Quality & Low-level Vision
- Q-Bench (qbenchs_dev)
- Q-Bench2-HF (qbench2_dev)
- Q-Bench-HF (qbench_dev)
- A-Bench-HF (abench_dev)
- SalBench (salbench)
- p3, p3_box, p3_box_img
- o3, o3_box, o3_box_img
- LV-Bench (lvbench)
- MMVP (mmvp)
- VSI-Bench (vsibench)
- HR-Bench (hrbench)
Specialized Vision Tasks
- CameraBench VQA (camerabench_vqa)
- CUVA (cuva)
- DTC-Bench (dtcbench)
- FunQA (funqa)
- LiveXiv VQA (livexiv_vqa)
- LiveXiv TQA (livexiv_tqa)
- LSD-Bench (lsdbench)
- MIA-Bench (mia_bench)
- SNS-Bench (snsbench)
- TOMATO (tomato)
- VMC-Bench (vmcbench)
- ViVerBench (viverbench)
- Visual Puzzles (VisualPuzzles)
- VisualWebBench (visualwebbench)
- V*-Bench (vstar_bench)
- WorldSense (worldsense)
- AV-SpeakerBench (av-speakerbench)
2. Multi-image Tasks
- CMMMU (cmmmu)
- CMMMU Validation (cmmmu_val)
- CMMMU Test (cmmmu_test)
- HallusionBench (hallusion_bench_image)
- ICON-QA (iconqa)
- ICON-QA Validation (iconqa_val)
- ICON-QA Test (iconqa_test)
- JMMMU (jmmmu)
- JMMMU-Pro (jmmmu_pro)
- LLaVA-NeXT-Interleave-Bench (llava_interleave_bench)
- llava_interleave_bench_in_domain
- llava_interleave_bench_out_domain
- llava_interleave_bench_multi_view
- MIRB (mirb)
- MMLongBench (mmlongbench)
- MMMU (mmmu)
- MMMU Validation (mmmu_val)
- MMMU Test (mmmu_test)
- MMMU_Pro (mmmu_pro)
- MMMU Pro (mmmu_pro)
- MMMU Pro Original (mmmu_pro_original)
- MMMU Pro Vision (mmmu_pro_vision)
- MMMU Pro COT (mmmu_pro_cot)
- MMMU Pro Original COT (mmmu_pro_original_cot)
- MMMU Pro Vision COT (mmmu_pro_vision_cot)
- MMMU Pro Composite COT (mmmu_pro_composite_cot)
- MMMU Pro (mmmu_pro)
- MMT Multiple Image (mmt_mi)
- MMT Multiple Image Validation (mmt_mi_val)
- MMT Multiple Image Test (mmt_mi_test)
- MuirBench (muirbench)
- MP-DocVQA (multidocvqa)
- MP-DocVQA Validation (multidocvqa_val)
- MP-DocVQA Test (multidocvqa_test)
- OlympiadBench (olympiadbench)
- OlympiadBench Test English (olympiadbench_test_en)
- OlympiadBench Test Chinese (olympiadbench_test_cn)
- OlympiadBench MIMO (olympiadbench_mimo)
- MEGA-Bench (megabench)
- MEGA-Bench Core (megabench_core)
- MEGA-Bench Open (megabench_open)
- MEGA-Bench Core single-image subset (megabench_core_si)
- MEGA-Bench Open single-image subset (megabench_open_si)
3. Video Tasks
General Video Understanding
- ActivityNet-QA (activitynetqa_generation)
- CVRR-ES (cvrr)
- cvrr_continuity_and_object_instance_count
- cvrr_fine_grained_action_understanding
- cvrr_interpretation_of_social_context
- cvrr_interpretation_of_visual_context
- cvrr_multiple_actions_in_a_single_video
- cvrr_non_existent_actions_with_existent_scene_depictions
- cvrr_non_existent_actions_with_non_existent_scene_depictions
- cvrr_partial_actions
- cvrr_time_order_understanding
- cvrr_understanding_emotional_context
- cvrr_unusual_and_physically_anomalous_activities
- CinePile (cinepile)
- EgoSchema (egoschema)
- egoschema_mcppl
- egoschema_subset_mcppl
- egoschema_subset
- EgoPlan (egoplan)
- EgoTempo (egotempo)
- EgoThink (egothink)
- MLVU (mlvu)
- MMT-Bench (mmt)
- MMT Validation (mmt_val)
- MMT Test (mmt_test)
- MVBench (mvbench)
- mvbench_action_sequence
- mvbench_moving_count
- mvbench_action_prediction
- mvbench_episodic_reasoning
- mvbench_action_antonym
- mvbench_action_count
- mvbench_scene_transition
- mvbench_object_shuffle
- mvbench_object_existence
- mvbench_fine_grained_pose
- mvbench_unexpected_action
- mvbench_moving_direction
- mvbench_state_change
- mvbench_object_interaction
- mvbench_character_order
- mvbench_action_localization
- mvbench_counterfactual_inference
- mvbench_fine_grained_action
- mvbench_moving_attribute
- mvbench_egocentric_navigation
- TVBench (tvbench)
- tvbench_action_antonym
- tvbench_action_count
- tvbench_action_localization
- tvbench_action_sequence
- tvbench_egocentric_sequence
- tvbench_moving_direction
- tvbench_object_count
- tvbench_object_shuffle
- tvbench_scene_transition
- tvbench_unexpected_action
- MotionBench (motionbench)
- motionbench_full
- NExT-QA (nextqa)
- NExT-QA Multiple Choice Test (nextqa_mc_test)
- NExT-QA Open Ended Validation (nextqa_oe_val)
- NExT-QA Open Ended Test (nextqa_oe_test)
- PerceptionTest (perceptiontest)
- PerceptionTest Test
- perceptiontest_test_mc
- perceptiontest_test_mcppl
- PerceptionTest Validation
- perceptiontest_val_mc
- perceptiontest_val_mcppl
- PerceptionTest Test
- PLM VideoBench (plm_videobench)
- SciVideoBench (scivideobench)
- MINERVA (minerva)
- Video-ChatGPT (videochatgpt)
- Video-ChatGPT Generic (videochatgpt_gen)
- Video-ChatGPT Temporal (videochatgpt_temporal)
- Video-ChatGPT Consistency (videochatgpt_consistency)
- Video-MME (videomme)
- Video-MMMU (videommmu)
- VideoEval-Pro (videoevalpro)
- VideoMathQA (videomathqa)
- Vinoground (vinoground)
- WorldQA (worldqa)
- WorldQA Generation (worldqa_gen)
- WorldQA Multiple Choice (worldqa_mc)
- WorldVQA (worldvqa)
- WorldQA Compatibility Generation (worldvqa_gen)
- WorldQA Compatibility Multiple Choice (worldvqa_mc)
- YouCook2 (youcook2_val)
Long Video & Temporal Understanding
- Charades-STA (charades_sta)
- FALCON-Bench (FALCONBench) - One-hour-long video understanding
- LEMONADE (lemonade)
- LongTimescope (longtimescope)
- LongVT (longvt) - Tool-based long video understanding
- LongVideoBench (longvideobench)
- NEPTUNE (neptune)
- Video-path subsets: neptune_full_v, neptune_mma_v, neptune_mmh_v
- Frame-sampled subsets: neptune_full_i, neptune_mma_i, neptune_mmh_i
- Example:
python -m lmms_eval --model qwen2_5_vl --tasks neptune_full_v --limit 5 --batch_size 1
- MovieChat (moviechat)
- Global Mode for entire video (moviechat_global)
- Breakpoint Mode for specific moments (moviechat_breakpoint)
- TempCompass (tempcompass)
- tempcompass_multi_choice
- tempcompass_yes_no
- tempcompass_caption_matching
- tempcompass_captioning
- TemporalBench (temporalbench)
- temporalbench_short_qa
- temporalbench_long_qa
- temporalbench_short_caption
- Timescope (timescope)
Video Captioning & Description
- Vatex (vatex)
- Vatex Chinese (vatex_val_zh)
- Vatex Test (vatex_test)
- VDC (vdc)
- VDC Detailed Caption (detailed_test)
- VDC Camera Caption (camera_test)
- VDC Short Caption (short_test)
- VDC Background Caption (background_test)
- VDC Main Object Caption (main_object_test)
- VideoDetailDescription (video_dc499)
- Video-TT (video-tt)
- VITATECS (vitatecs)
- VITATECS Direction (vitatecs_direction)
- VITATECS Intensity (vitatecs_intensity)
- VITATECS Sequence (vitatecs_sequence)
- VITATECS Compositionality (vitatecs_compositionality)
- VITATECS Localization (vitatecs_localization)
- VITATECS Type (vitatecs_type)
4. Audio & Speech Tasks
Speech Recognition
- Common Voice 15 (common_voice_15)
- FLEURS (fleurs)
- GigaSpeech (gigaspeech)
- LibriSpeech (librispeech)
- Open ASR (open_asr)
- People Speech (people_speech)
- TEDLium (tedlium)
- WenetSpeech (wenet_speech)
- XLRS (xlrs)
Speech Translation
- CoVoST2 (covost2)
Audio Understanding & QA
- AIR-Bench (air_bench)
- air_bench_chat (chat-based audio understanding)
- air_bench_foundation (foundation audio tasks)
- Alpaca Audio (alpaca_audio)
- AV-Odyssey (av_odyssey)
- Clotho-AQA (clotho_aqa)
- MuchoMusic (muchomusic)
- MMAU (mmau)
- Step2 Audio Paralinguistic (step2_audio_paralinguistic)
- VocalSound (vocalsound)
- VoiceBench (voicebench)
- WavCaps (wavcaps)
5. Document Understanding Tasks
- DOCVQA (docvqa)
- DOCVQA Validation (docvqa_val)
- DOCVQA Test (docvqa_test)
- DUDE (dude)
- MMLongBench-Doc (mmlongbench_doc)
- GEdit-Bench (gedit_bench)
- Infographic VQA (infovqa)
- Infographic VQA Validation (infovqa_val)
- Infographic VQA Test (infovqa_test)
- OfficeQA (officeqa)
- OCRBench (ocrbench)
- OCRBench v2 (ocrbench_v2)
- OmniDocBench (omnidocbench)
- PRISMM-Bench (prismm_bench)
- PRISMM-Bench Identification (prismm_bench_identification)
- PRISMM-Bench Identification Whole Page Context (prismm_bench_identification_whole_page)
- PRISMM-Bench Identification Whole Document Context (prismm_bench_identification_whole_doc)
- PRISMM-Bench Remedy (prismm_bench_remedy)
- PRISMM-Bench Remedy Whole Page Context (prismm_bench_remedy_whole_page)
- PRISMM-Bench Remedy Whole Document Context(prismm_bench_remedy_whole_doc)
- PRISMM-Bench Pair Match (prismm_bench_pair_match)
- ScreenSpot (screenspot)
- ScreenSpot REC / Grounding (screenspot_rec)
- ScreenSpot REG / Instruction Generation (screenspot_reg)
- ST-VQA (stvqa)
- SynthDog (synthdog)
- SynthDog English (synthdog_en)
- SynthDog Chinese (synthdog_zh)
- TextCaps (textcaps)
- TextCaps Validation (textcaps_val)
- TextCaps Test (textcaps_test)
- TextVQA (textvqa)
- TextVQA Validation (textvqa_val)
- TextVQA Test (textvqa_test)
- WebSRC (websrc)
- WebSRC Validation (websrc_val)
- WebSRC Test (websrc_test)
6. Mathematical Reasoning Tasks
- AIME (aime)
- DynaMath (dynamath)
- GSM8K (gsm8k)
- MathCanvas (mathcanvas)
- MathCanvas Algebra (mathcanvas_algebra)
- MathCanvas Analytic Geometry (mathcanvas_analytic_geometry)
- MathCanvas Calculus and Vector (mathcanvas_calculus_and_vector)
- MathCanvas Plane Geometry (mathcanvas_plane_geometry)
- MathCanvas Solid Geometry (mathcanvas_solid_geometry)
- MathCanvas Statistics (mathcanvas_statistics)
- MathCanvas Transformational Geometry (mathcanvas_transformational_geometry)
- MathCanvas Trigonometry (mathcanvas_trigonometry)
- MathKangaroo (mathkangaroo)
- MathVerse (mathverse)
- MathVerse Text Dominant (mathverse_testmini_text_dominant)
- MathVerse Text Only (mathverse_testmini_text_only)
- MathVerse Text Lite (mathverse_testmini_text_lite)
- MathVerse Vision Dominant (mathverse_testmini_vision_dominant)
- MathVerse Vision Intensive (mathverse_testmini_vision_intensive)
- MathVerse Vision Only (mathverse_testmini_vision_only)
- MathVision (mathvision)
- MathVision TestMini (mathvision_testmini)
- MathVision Test (mathvision_test)
- MathVision Reason TestMini (mathvision_reason_testmini)
- MathVision Reason Test (mathvision_reason_test)
- MathVista (mathvista)
- MathVista Validation (mathvista_testmini)
- MathVista Test (mathvista_test)
- OpenAI Math (openai_math)
- SciBench (scibench)
- WeMath (wemath)
7. Spatial & Grounding Tasks
Referring Expression Comprehension
- Ferret (ferret)
- OSWorld-Verified (OSWorld-G) (osworld_g)
- RefCOCO (refcoco)
- refcoco_seg_test, refcoco_seg_val
- refcoco_seg_testA, refcoco_seg_testB
- refcoco_bbox_test, refcoco_bbox_val
- refcoco_bbox_testA, refcoco_bbox_testB
- RefCOCO+ (refcoco+)
- refcoco+_seg_val, refcoco+_seg_testA, refcoco+_seg_testB
- refcoco+_bbox_val, refcoco+_bbox_testA, refcoco+_bbox_testB
- RefCOCOg (refcocog)
- refcocog_seg_test, refcocog_seg_val
- refcocog_bbox_test, refcocog_bbox_val
- RefSpatial (refspatial)
Spatial Reasoning
- CS-Bench (csbench)
- EmbSpatial (embspatial)
- ERQA (erqa)
- OmniSpatial (omnispatial)
- Point-Bench (pointbench)
- Where2Place (where2place)
8. Text-only Language Tasks
- ARC (arc)
- GPQA (gpqa)
- GSM8K (gsm8k)
- HellaSwag (hellaswag)
- IFEval (ifeval)
- K12 (k12)
- LogicVista (logicvista)
- MedQA (medqa)
- MMLU (mmlu)
- MMLU_Pro (mmlu_pro)
- OpenHermes (openhermes)
- Super GPQA (super_gpqa)
9. Multimodal Evaluation & Meta-benchmarks
- Capability (capability)
- EMMA (emma)
- MindCube (mindcube)
- Mix Evals (mix_evals)
- Multimodal RewardBench (multimodal_rewardbench)
- Omni-Bench (omni_bench)
- PhyX (phyx) - Physics grounded reasoning
- UEval (ueval)
- VL-RewardBench (vl_rewardbench)
10. Supported Models
Chat Template Models (Recommended)
| Model Name | Class | Description |
|---|---|---|
qwen2_5_vl | Qwen2_5_VL | Qwen2.5-VL vision-language model |
qwen3_vl | Qwen3_VL | Qwen3-VL vision-language model |
llava_hf | LlavaHf | LLaVA via Hugging Face |
llava_onevision1_5 | Llava_OneVision1_5 | LLaVA-OneVision 1.5 |
thyme | Thyme | Thyme multimodal model |
longvila | LongVila | Long video understanding model |
bagel_lmms_engine | BagelLmmsEngine | Bagel LMMS engine |
vllm | VLLM | vLLM backend |
vllm_generate | VLLMGenerate | vLLM generation mode |
sglang | Sglang | SGLang serving backend |
huggingface | Huggingface | Generic HuggingFace models |
openai | OpenAICompatible | OpenAI-compatible APIs (aliases: openai_compatible, openai_compatible_chat) |
async_openai | AsyncOpenAIChat | Async OpenAI chat (alias: async_openai_compatible_chat) |
Simple/Legacy Models
| Model Name | Class | Modality |
|---|---|---|
aero | Aero | Image, Audio |
aria | Aria | Image, Video |
auroracap | AuroraCap | Image, Video captioning |
bagel | Bagel | Image |
batch_gpt4 | BatchGPT4 | API |
claude | Claude | Image, Video |
cogvlm2 | CogVLM2 | Image |
egogpt | EgoGPT | Video |
from_log | FromLog | Utility |
fuyu | Fuyu | Image |
gemini_api | GeminiAPI | Image, Audio |
gemma3 | Gemma3 | Image |
gpt4o_audio | GPT4OAudio | Audio, Vision API |
gpt4v | GPT4V | Image, Video API |
idefics2 | Idefics2 | Image |
instructblip | InstructBLIP | Image |
internvideo2 | InternVideo2 | Video |
internvideo2_5 | InternVideo2_5 | Video |
internvl | InternVLChat | Image |
internvl2 | InternVL2 | Image |
llama_vid | LLaMAVid | Video |
llama_vision | LlamaVision | Image |
llava | Llava | Image |
llava_onevision | Llava_OneVision | Image, Video |
llava_onevision_moviechat | Llava_OneVision_MovieChat | Long Video |
llava_sglang | LlavaSglang | Image (vLLM) |
llava_vid | LlavaVid | Video |
longva | LongVA | Long Video |
mantis | Mantis | Multi-image |
minicpm_v | MiniCPM_V | Image |
minimonkey | MiniMonkey | Image |
moviechat | MovieChat | Long Video |
mplug_owl_video | mplug_Owl | Video |
ola | Ola | Multimodal |
oryx | Oryx | Multimodal |
phi3v | Phi3v | Image |
phi4_multimodal | Phi4 | Multimodal |
plm | PerceptionLM | Image |
qwen_vl | Qwen_VL | Image |
qwen_vl_api | Qwen_VL_API | API |
qwen2_5_omni | Qwen2_5_Omni | Image, Video, Audio |
qwen2_audio | Qwen2_Audio | Audio |
qwen2_vl | Qwen2_VL | Image, Video |
reka | Reka | Multimodal API |
ross | Ross | Multimodal |
slime | Slime | Multimodal |
srt_api | SRT_API | API |
tinyllava | TinyLlava | Image |
videoChatGPT | VideoChatGPT | Video |
video_llava | VideoLLaVA | Video |
videochat2 | VideoChat2 | Video |
videochat_flash | VideoChat_Flash | Video |
videollama3 | VideoLLaMA3 | Video |
vila | VILA | Image, Video |
vita | VITA | Multimodal |
vora | VoRA | Multimodal |
whisper | Whisper | Audio |
whisper_vllm | WhisperVllm | Audio |
xcomposer2_4KHD | XComposer2_4KHD | High-resolution Image |
xcomposer2d5 | XComposer2D5 | Image |
Modality Support Summary
| Modality | Model Count |
|---|---|
| Image | 50+ |
| Video | 20+ |
| Audio | 5+ |
| Multimodal (2+ modalities) | 15+ |
| API-based | 8+ |
| Total Unique Models | 70+ |