Multimodal Benchmarking
GuideLLM provides robust support for benchmarking multimodal models, allowing evaluation of performance across vision, audio, and video tasks. This section contains guides for setting up and running benchmarks for different modalities using OpenAI-compatible endpoints, such as those provided by vLLM.
Prerequisites
To run multimodal benchmarks, you must install GuideLLM with the appropriate extras:
# For all multimodal features
pip install guidellm[vision,audio]
# For specific modalities
pip install guidellm[vision] # Images and Video
pip install guidellm[audio] # Audio
Ensure you have a running inference server and model compatible with the OpenAI API that supports the specific modality you intend to test. Refer to the individual guides below for instructions on benchmarking each modality.
Available Guides
- Images
Benchmark Vision-Language Models (VLMs) with image inputs using the Chat Completions API. Covers visual question answering and image captioning.
- Video
Evaluate video understanding models by processing video inputs. Includes configuration for video request formatting and encoding options.
- Audio
Benchmark audio transcription (ASR), translation, and audio-native chat models.