Model Guide
New Model Guide
To evaluate a model with lmms_eval, you implement a wrapper class that subclasses lmms_eval.api.model.lmms. This guide walks through the full process.
Architecture Overview
╭──────────────╮ ╭─────────────╮
│ Model Dev │ │ Task Dev │
╰──────┬───────╯ ╰──────┬──────╯
│ │
▼ ▼
┌─────────────────────┐ ┌─────────────────────┐
│ │ │ │
│ Implement lmms │ │ Create task YAML │
│ wrapper │ │ + utils.py │
│ │ │ │
│ Core methods: │ │ Preferred: │
│ · generate_until │ │ · doc_to_messages │
│ · loglikelihood │ │ │
│ │ │ Legacy: │
│ Register │ │ · doc_to_visual │
│ ModelManifest in │ │ · doc_to_text │
│ ModelRegistryV2 │ │ │
└─────────┬───────────┘ └──────────┬──────────┘
│ │
│ ╭ ─ ─ ─ ─ ─ ─ ─ ╮ │
╰──────────▶ Evaluator ◀──────────────╯
│ contract │
╰ ─ ─ ─ ┬ ─ ─ ─╯
│
▼
┌───────────────────────┐
│ │
│ Unified Instance │
│ requests │
│ │
│ Model inference │
│ │
│ process_results │
│ │
│ metrics aggregation │
│ │
└───────────────────────┘Model Dev implements the left side; Task Dev implements the right side. The evaluator runtime wires them together - your model never needs to know which task is calling it, and vice versa.
Model Types
| Type | Location | Input method | Recommendation |
|---|---|---|---|
| Chat | models/chat/ | doc_to_messages - structured messages with roles and content types | Use this |
| Simple (legacy) | models/simple/ | doc_to_visual + doc_to_text - plain text with \<image\> placeholders | Legacy only |
Setup
git clone https://github.com/<YOUR-USERNAME>/lmms-eval.git
cd lmms-eval
git checkout -b <model-type>
pip install -e .
# Create your model file
touch lmms_eval/models/chat/<my_model>.py # recommended
touch lmms_eval/models/simple/<my_model>.py # legacyReference implementations: lmms_eval/models/chat/qwen2_5_vl.py (chat) and lmms_eval/models/simple/instructblip.py (simple).
Core Methods
All models must subclass lmms_eval.api.model.lmms and implement two methods. Each receives a list of Instance objects (defined in lmms_eval.api.instance) whose .args carry the request payload.
generate_until
Open-ended generation. The model produces text given an input prompt + media.
Instance.args for chat models (5 elements):
| Element | Type | Description |
|---|---|---|
doc_to_messages | Callable | Function that converts a doc into structured ChatMessages |
gen_kwargs | dict | Generation config: max_new_tokens, temperature, until, etc. |
doc_id | int | Index into the dataset split |
task | str | Task name (used to look up the dataset via self.task_dict) |
split | str | Dataset split name |
Instance.args for simple models (6 elements):
| Element | Type | Description |
|---|---|---|
contexts | str | Formatted question text (may contain \<image\> tokens) |
gen_kwargs | dict | Generation config |
doc_to_visual | Callable | Function that returns a list of media (PIL images, video paths, etc.) |
doc_id | int | Index into the dataset split |
task | str | Task name |
split | str | Dataset split name |
Returns list[str] - one generated string per request.
loglikelihood
Scoring for multiple-choice tasks. The model computes the log-probability of a target continuation given a context.
Instance.args (6 elements):
| Element | Type | Description |
|---|---|---|
contexts | str | Formatted question text |
doc_to_target | Callable | Function that extracts the answer continuation from the doc |
doc_to_visual | Callable | Function that returns media |
doc_id | int | Index into the dataset split |
task | str | Task name |
split | str | Dataset split name |
Returns list[tuple[float, bool]] - (log_prob, is_greedy) per request, where is_greedy is True if the target would be produced by greedy decoding.
Registration
Register your model so lmms_eval can find it via --model \<name\>.
from lmms_eval.api.registry import register_model
@register_model("my_model")
class MyModel(lmms):
is_simple = False # chat model (recommended)
# is_simple = True # simple model (legacy, default)Then add the entry in lmms_eval/models/__init__.py:
# Recommended (ModelRegistryV2 manifest)
from lmms_eval.models.registry_v2 import ModelManifest
MODEL_REGISTRY_V2.register_manifest(
ModelManifest(
model_id="my_model",
chat_class_path="lmms_eval.models.chat.my_model.MyModel",
)
)
# Legacy (still supported)
AVAILABLE_CHAT_TEMPLATE_MODELS["my_model"] = "MyModel"For external plugin packages, prefer Python entry-points (lmms_eval.models) over LMMS_EVAL_PLUGINS.
Complete Example (Chat Model)
from lmms_eval.api.registry import register_model
from lmms_eval.api.model import lmms
from lmms_eval.api.instance import Instance
from lmms_eval.protocol import ChatMessages
import torch
@register_model("my_image_model")
class MyImageModel(lmms):
is_simple = False
def __init__(self, pretrained: str, device: str = "cuda", **kwargs):
super().__init__()
self.device = device
self.model = load_your_model(pretrained)
self.processor = load_your_processor(pretrained)
def generate_until(self, requests: list[Instance]) -> list[str]:
results = []
for request in requests:
doc_to_messages, gen_kwargs, doc_id, task, split = request.args
# Build structured messages from the doc
doc = self.task_dict[task][split][doc_id]
raw_messages = doc_to_messages(doc)
messages = ChatMessages(messages=raw_messages)
# Extract media and format prompt
images, videos, audios = messages.extract_media()
hf_messages = messages.to_hf_messages()
text = self.processor.apply_chat_template(hf_messages)
# Run inference
inputs = self.processor(
text=text, images=images, return_tensors="pt"
).to(self.device)
with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_new_tokens=gen_kwargs.get("max_new_tokens", 128),
temperature=gen_kwargs.get("temperature", 0.0),
do_sample=gen_kwargs.get("do_sample", False),
)
response = self.processor.decode(
outputs[0], skip_special_tokens=True
)
results.append(response)
return results
def loglikelihood(
self, requests: list[Instance]
) -> list[tuple[float, bool]]:
results = []
for request in requests:
contexts, doc_to_target, doc_to_visual, doc_id, task, split = (
request.args
)
# Compute log-probability of the target continuation
# given the context + visual inputs.
# ...
return resultsFor video and audio models the pattern is identical - the only difference is which media you extract from messages.extract_media(). See lmms_eval/models/chat/qwen2_5_vl.py for a production-quality reference.
Key Notes
- Implement both
generate_untilandloglikelihoodif your model supports generation and multiple-choice tasks - Handle different modalities (image, video, audio) via the
ChatMessagesprotocol - Follow existing implementations in
lmms_eval/models/chat/for patterns around batching, device management, and error handling