Throughput Metrics Documentation

This document describes the inference throughput metrics automatically logged by LMMS-Eval chat models during evaluation.

Overview

LMMS-Eval chat models automatically log detailed timing metrics during inference to help users understand model performance characteristics. These metrics are logged at the INFO level and provide insights into end-to-end latency, token generation speed, and other performance indicators.

Metrics Explained

Core Timing Metrics

E2E (End-to-End Latency): Total time from request submission to response completion (in seconds)
TTFT (Time to First Token): Time from request submission until the first token is generated (in seconds)
TPOT (Time Per Output Token): Average time to generate each output token after the first (in seconds)
Speed (Inference Speed): Token generation rate calculated as 1/TPOT (tokens per second)
Output Tokens: Number of tokens generated in the response

Batch Metrics

For models that process multiple requests in batches:

Batch Summary: Aggregated metrics across all outputs in a batch
Total Time: Total batch processing time
Total Tokens: Sum of all output tokens in the batch
Avg Speed: Average throughput across the entire batch (tokens/s)

Log Format Examples

Individual Output Metrics

Output 0 - E2E: 2.145s, TTFT: 0.215s, TPOT: 0.048s, Speed: 20.8 tokens/s, Output tokens: 42

Batch Summary Metrics

Batch summary - Total time: 2.145s, Total tokens: 128, Avg speed: 59.7 tokens/s

Single Request Metrics (Non-batched)

Inference metrics - E2E: 1.823s, TTFT: 0.182s, TPOT: 0.052s, Speed: 19.2 tokens/s, Output tokens: 32

Backend Coverage

All chat backends listed below log throughput-oriented metrics (total_gen_tokens, total_elapsed_time, avg_speed):

vllm (/lmms_eval/models/chat/vllm.py)
vllm_generate (/lmms_eval/models/chat/vllm_generate.py)
sglang (/lmms_eval/models/chat/sglang.py)
openai (/lmms_eval/models/chat/openai.py)
async_openai (/lmms_eval/models/chat/async_openai.py)
huggingface (/lmms_eval/models/chat/huggingface.py)
qwen2_5_vl (/lmms_eval/models/chat/qwen2_5_vl.py)
qwen3_vl (/lmms_eval/models/chat/qwen3_vl.py)
llava_hf (/lmms_eval/models/chat/llava_hf.py)
internvl_hf (/lmms_eval/models/chat/internvl_hf.py)
llava_onevision1_5 (/lmms_eval/models/chat/llava_onevision1_5.py)
thyme (/lmms_eval/models/chat/thyme.py)

TTFT/TPOT coverage is narrower:

Native TTFT/TPOT in run summary: vllm, vllm_generate
Throughput-only (no native TTFT/TPOT in summary): sglang, openai, async_openai, huggingface, qwen2_5_vl, qwen3_vl, llava_hf, internvl_hf, llava_onevision1_5, thyme

Usage

Throughput metrics are automatically logged during evaluation - no additional configuration is required. To view the metrics:

Command Line Output: Metrics appear in real-time during evaluation
Log Files: Metrics are written to log files if logging is configured
Log Level: Ensure logging level is set to INFO or lower to see metrics

Example Evaluation Command

python -m lmms_eval \
    --model sglang_runtime \
    --model_args model=Qwen/Qwen2.5-VL-3B-Instruct \
    --tasks mme \
    --batch_size 4 \
    --log_samples \
    --output_path ./results

Metric Calculation Details

TTFT Calculation

Available from model runtime: Uses actual first-token timing when backend exposes it (currently vLLM paths)
Unavailable case: Backends without first-token timing expose throughput metrics only

TPOT Calculation

Native formula: (E2E_latency - TTFT) / (output_tokens - 1) when TTFT and token-level timings are available
Throughput proxy: total_elapsed_time / total_gen_tokens can be derived from summary metrics as a coarse decode-time estimate

Speed Calculation

Formula: 1 / TPOT (when TPOT > 0)
Edge cases: Set to 0 for single-token responses or zero TPOT

Performance Analysis

Interpreting Metrics

High TTFT: May indicate model loading, prompt processing, or scheduling delays
High TPOT: Suggests slower token generation, possibly due to model size or hardware limitations
Low Speed: Indicates throughput bottlenecks in token generation
E2E vs TTFT+TPOT: Large differences may suggest batching overhead or system delays

Optimization Insights

Reduce TTFT: Optimize prompt processing, use model caching, improve scheduling
Reduce TPOT: Use faster hardware, optimize model inference, adjust batch sizes
Batch Efficiency: Compare individual vs batch metrics to assess batching benefits

Troubleshooting

Missing Metrics

Ensure model supports throughput logging (see supported models list)
Check log level is set to INFO or lower
Verify model implementation includes timing instrumentation

Incomplete Metrics

TTFT is backend-dependent and may be unavailable for non-vLLM paths
Batch metrics average across multiple outputs, so individual request variance is not captured
API-backed latency includes network overhead

Implementation Notes

Throughput metrics are implemented across chat models using:

wall-clock timing for batch/request elapsed time
backend-specific metadata where available (for example, vLLM runtime metrics)
structured logging via log_metrics() and aggregation via summarize_logged_metrics()

Throughput Metrics

On this page