lmms-eval

Model Guide

New Model Guide

To evaluate a model with lmms_eval, you implement a wrapper class that subclasses lmms_eval.api.model.lmms. This guide walks through the full process.

Architecture Overview

    ╭──────────────╮                              ╭─────────────╮
    │  Model Dev   │                              │  Task Dev   │
    ╰──────┬───────╯                              ╰──────┬──────╯
           │                                             │
           ▼                                             ▼
 ┌─────────────────────┐                    ┌─────────────────────┐
 │                     │                    │                     │
 │  Implement lmms     │                    │  Create task YAML   │
 │  wrapper            │                    │  + utils.py         │
 │                     │                    │                     │
 │  Core methods:      │                    │  Preferred:         │
 │  · generate_until   │                    │  · doc_to_messages  │
 │  · loglikelihood    │                    │                     │
 │                     │                    │  Legacy:            │
 │  Register           │                    │  · doc_to_visual    │
 │  ModelManifest in   │                    │  · doc_to_text      │
 │  ModelRegistryV2    │                    │                     │
 └─────────┬───────────┘                    └──────────┬──────────┘
           │                                           │
           │           ╭ ─ ─ ─ ─ ─ ─ ─ ╮              │
           ╰──────────▶   Evaluator     ◀──────────────╯
                       │   contract     │
                        ╰ ─ ─ ─ ┬ ─ ─ ─╯


                  ┌───────────────────────┐
                  │                       │
                  │  Unified Instance     │
                  │  requests             │
                  │                       │
                  │  Model inference      │
                  │                       │
                  │  process_results      │
                  │                       │
                  │  metrics aggregation  │
                  │                       │
                  └───────────────────────┘

Model Dev implements the left side; Task Dev implements the right side. The evaluator runtime wires them together - your model never needs to know which task is calling it, and vice versa.

Model Types

TypeLocationInput methodRecommendation
Chatmodels/chat/doc_to_messages - structured messages with roles and content typesUse this
Simple (legacy)models/simple/doc_to_visual + doc_to_text - plain text with \<image\> placeholdersLegacy only

Setup

git clone https://github.com/<YOUR-USERNAME>/lmms-eval.git
cd lmms-eval
git checkout -b <model-type>
pip install -e .

# Create your model file
touch lmms_eval/models/chat/<my_model>.py     # recommended
touch lmms_eval/models/simple/<my_model>.py   # legacy

Reference implementations: lmms_eval/models/chat/qwen2_5_vl.py (chat) and lmms_eval/models/simple/instructblip.py (simple).

Core Methods

All models must subclass lmms_eval.api.model.lmms and implement two methods. Each receives a list of Instance objects (defined in lmms_eval.api.instance) whose .args carry the request payload.

generate_until

Open-ended generation. The model produces text given an input prompt + media.

Instance.args for chat models (5 elements):

ElementTypeDescription
doc_to_messagesCallableFunction that converts a doc into structured ChatMessages
gen_kwargsdictGeneration config: max_new_tokens, temperature, until, etc.
doc_idintIndex into the dataset split
taskstrTask name (used to look up the dataset via self.task_dict)
splitstrDataset split name

Instance.args for simple models (6 elements):

ElementTypeDescription
contextsstrFormatted question text (may contain \<image\> tokens)
gen_kwargsdictGeneration config
doc_to_visualCallableFunction that returns a list of media (PIL images, video paths, etc.)
doc_idintIndex into the dataset split
taskstrTask name
splitstrDataset split name

Returns list[str] - one generated string per request.

loglikelihood

Scoring for multiple-choice tasks. The model computes the log-probability of a target continuation given a context.

Instance.args (6 elements):

ElementTypeDescription
contextsstrFormatted question text
doc_to_targetCallableFunction that extracts the answer continuation from the doc
doc_to_visualCallableFunction that returns media
doc_idintIndex into the dataset split
taskstrTask name
splitstrDataset split name

Returns list[tuple[float, bool]] - (log_prob, is_greedy) per request, where is_greedy is True if the target would be produced by greedy decoding.

Registration

Register your model so lmms_eval can find it via --model \<name\>.

from lmms_eval.api.registry import register_model

@register_model("my_model")
class MyModel(lmms):
    is_simple = False  # chat model (recommended)
    # is_simple = True  # simple model (legacy, default)

Then add the entry in lmms_eval/models/__init__.py:

# Recommended (ModelRegistryV2 manifest)
from lmms_eval.models.registry_v2 import ModelManifest

MODEL_REGISTRY_V2.register_manifest(
    ModelManifest(
        model_id="my_model",
        chat_class_path="lmms_eval.models.chat.my_model.MyModel",
    )
)

# Legacy (still supported)
AVAILABLE_CHAT_TEMPLATE_MODELS["my_model"] = "MyModel"

For external plugin packages, prefer Python entry-points (lmms_eval.models) over LMMS_EVAL_PLUGINS.

Complete Example (Chat Model)

from lmms_eval.api.registry import register_model
from lmms_eval.api.model import lmms
from lmms_eval.api.instance import Instance
from lmms_eval.protocol import ChatMessages
import torch


@register_model("my_image_model")
class MyImageModel(lmms):
    is_simple = False

    def __init__(self, pretrained: str, device: str = "cuda", **kwargs):
        super().__init__()
        self.device = device
        self.model = load_your_model(pretrained)
        self.processor = load_your_processor(pretrained)

    def generate_until(self, requests: list[Instance]) -> list[str]:
        results = []
        for request in requests:
            doc_to_messages, gen_kwargs, doc_id, task, split = request.args

            # Build structured messages from the doc
            doc = self.task_dict[task][split][doc_id]
            raw_messages = doc_to_messages(doc)
            messages = ChatMessages(messages=raw_messages)

            # Extract media and format prompt
            images, videos, audios = messages.extract_media()
            hf_messages = messages.to_hf_messages()
            text = self.processor.apply_chat_template(hf_messages)

            # Run inference
            inputs = self.processor(
                text=text, images=images, return_tensors="pt"
            ).to(self.device)

            with torch.no_grad():
                outputs = self.model.generate(
                    **inputs,
                    max_new_tokens=gen_kwargs.get("max_new_tokens", 128),
                    temperature=gen_kwargs.get("temperature", 0.0),
                    do_sample=gen_kwargs.get("do_sample", False),
                )

            response = self.processor.decode(
                outputs[0], skip_special_tokens=True
            )
            results.append(response)
        return results

    def loglikelihood(
        self, requests: list[Instance]
    ) -> list[tuple[float, bool]]:
        results = []
        for request in requests:
            contexts, doc_to_target, doc_to_visual, doc_id, task, split = (
                request.args
            )
            # Compute log-probability of the target continuation
            # given the context + visual inputs.
            # ...
        return results

For video and audio models the pattern is identical - the only difference is which media you extract from messages.extract_media(). See lmms_eval/models/chat/qwen2_5_vl.py for a production-quality reference.

Key Notes

  • Implement both generate_until and loglikelihood if your model supports generation and multiple-choice tasks
  • Handle different modalities (image, video, audio) via the ChatMessages protocol
  • Follow existing implementations in lmms_eval/models/chat/ for patterns around batching, device management, and error handling