Audio Benchmarking
This guide demonstrates how to benchmark audio models for tasks like Automatic Speech Recognition (ASR) (audio/transcriptions), Translation (audio/translations), and Audio Chat (chat/completions).
Setup
First, ensure you have a running inference server and model compatible with the desired audio APIs. GuideLLM supports any OpenAI-compatible server that can handle audio inputs through chat/completions, audio/transcriptions, or audio/translations endpoints. For the benchmarking examples below, we’ll use vLLM serving a Whisper model for transcription and translation tasks and Ultravox for chat. Here are sample commands to start each of these servers:
# Whisper ASR/Translation
vllm serve openai/whisper-small
# Ultravox Audio Chat
vllm serve fixie-ai/ultravox-v0_5-llama-3_2-1b
Next, either on the same instance or another machine that can reach your server (recommended), install GuideLLM with audio support:
Finally, ensure you have a dataset with supported audio files for benchmarking. GuideLLM can handle audio data from Hugging Face datasets, local files, URLs, etc. For the examples below, we’ll use the openslr/librispeech_asr dataset.
Processing Options
All of the standard arguments for benchmarking apply to audio tasks as well, such as --profile, profile rate parameters, and --constraint kind=max_requests,count=<n>, among others. There are a few additional options that help control audio-specific data handling and request formatting.
Data Loading
GuideLLM supports multiple methods for loading audio data. First, the overall data source must be deserializable by GuideLLM into a Hugging Face dataset. This includes local files, Hugging Face datasets, JSON files, etc.
Next, the desired audio column within the deserializable data source must be supported by GuideLLM’s audio data decoder/encoder. Supported formats include:
- Hugging Face Audio feature (preferred)
- Local file paths (e.g., .wav, .mp3, .flac)
- URLs pointing to audio files
- Base64-encoded audio data
- Numpy or PyTorch arrays with raw audio samples
Data Column Mapping
When specifying the dataset, generally, you will want to map the specific audio column to GuideLLM’s expected audio_column so it knows which data to process as audio. If nothing is specified, GuideLLM will attempt to auto-detect an audio column based on commonly used names such as audio, speech, wav, etc.
To specify the mapping, use the --data-column-mapper argument with a JSON string that specifies an existing column name for audio_column. For example, if your dataset has an audio column named speech_data, you would use:
--data-column-mapper '{"kind":"generative_column_mapper","column_mappings": {"audio_column": "speech_data"}}'
If you are combining multiple datasets (e.g., for prompts and audio), prepend the column name with the dataset index (starting at 0) or the dataset alias followed by a dot. For example, if the audio column is in the second dataset (index 1):
--data-column-mapper '{"kind":"generative_column_mapper","column_mappings": {"1.audio_column": "speech_data"}}'
Request Formatting
Across the supported audio endpoints, a request formatter encodes audio data and formats the request payload. This uses reasonable defaults out of the box, but can be customized as needed. The following options are available for audio request formatting via the --request-formatter-kwargs argument, provided as a JSON string.
"encode_kwargs"
A dictionary of arguments passed to the audio encoder that controls how audio data is preprocessed before being included in the request.
Note on Nesting:
- For Chat Completions (
chat_completions), audio arguments must be nested under anaudiokey withinencode_kwargs. - For Transcription/Translation (
audio_transcriptions,audio_translations), arguments are provided at the top level ofencode_kwargs.
Supported arguments include:
- "sample_rate": The sample rate of the input audio data. Only required if it cannot be inferred (e.g., for raw numpy/torch arrays).
- "encode_sample_rate": Target sample rate for the audio sent to the API. (default: 16000 Hz).
- "audio_format": File format for the payload. Supported formats are "wav", "mp3", and "flac". If not specified, the format is auto-detected from the source audio codec. When detection is not possible (e.g., raw arrays), defaults to "wav".
- "bitrate": Bitrate for lossy formats like mp3 (default: "64k").
- "max_duration": If specified, audio longer than this duration (in seconds) will be truncated.
- "mono": Whether to convert audio to mono (default: True).
- "file_name": Optional file name to include in the request metadata (useful for endpoints that rely on filename extensions). Default is "audio.wav".
Examples:
For Audio Transcription (flat structure), converting to 16kHz WAV:
--request-formatter-kwargs '{"encode_kwargs": {"audio_format": "wav", "encode_sample_rate": 16000}}'
For Audio Chat (nested structure), truncating to 30 seconds:
"extras"
A dictionary of extra arguments to include directly in the request, enabling direct control over endpoint-specific parameters, such as language for Whisper models. Within extras, you can specify where to include the extra arguments:
- "headers": Include in request headers
- "params" / "body": Include in request parameters or body (auto-detected based on endpoint)
For example, to specify French as the target language for an audio translation request:
"stream"
Turn streaming responses on or off (if supported by the backend) using a boolean value. By default, streaming is enabled. Pass stream=false in the backend configuration:
Expected Results
GuideLLM captures comprehensive metrics across the entire request lifecycle, stored in GenerativeRequestStats and aggregated into GenerativeMetrics. Results are displayed in the console and saved to local files for further analysis.
Output Files
benchmarks.json: The complete hierarchical statistics object containing scheduler timings, request distributions, and detailed metric summaries for text and audio.benchmarks.csv: A row-per-request export ofGenerativeRequestStats, useful for analyzing individual request performance, latency, and specific input/output configurations.benchmarks.html: A visual report summarizing performance.
Captured Metrics
In addition to standard performance metrics like Latency, Time to First Token (TTFT), and Inter-Token Latency (ITL), audio benchmarks track specific usage metrics across Input and Output:
- Audio Tokens: Number of audio tokens processed (if supported by the model).
- Audio Samples: Count of raw audio samples.
- Audio Seconds: Total duration of audio content in seconds.
- Audio Bytes: Size of the audio payload in bytes.
- Text Metrics: Standard counts for Tokens, Words, and Characters are also tracked for transcriptions or chat responses.
Statistical Analysis
For each metric above, GuideLLM calculates statistical distributions including:
- Values: Mean, Median, P95, P99, Min, Max.
- Rates: Throughput per second (e.g.,
audio_seconds/sec). - Concurrency: Measures of concurrent active processing.
These use the StatusDistributionSummary structure to track Successful, Incomplete, and Errored requests separately.
Examples
1. Audio Transcription (ASR)
This benchmark tests Automatic Speech Recognition (ASR) models, such as Whisper, for converting audio to text. Use the Whisper vLLM serving command above or a similar model that supports the audio transcription endpoint. For this example, we use only the audio data from the LibriSpeech dataset; however, a prompt can also be provided if desired, as shown in the Audio Chat example.
Command:
guidellm run \
--backend kind=openai_http,target=http://localhost:8000,request_format=/v1/audio/transcriptions \
--profile kind=synchronous \
--constraint kind=max_requests,count=20 \
--data '{"kind":"huggingface","source":"openslr/librispeech_asr","load_kwargs":{"name":"clean","split":"test"}}' \
--data-column-mapper '{"kind":"generative_column_mapper","column_mappings":{"audio_column":"audio"}}'
Key Parameters
--backend: Server URL andrequest_format=/v1/audio/transcriptionsfor ASR--profile kind=synchronous: Run requests sequentially--constraint kind=max_requests,count=20: Limits the benchmark to 20 total requests--data: HuggingFace dataset withload_kwargsselecting the "clean" config and "test" split--data-column-mapper: Maps the dataset's audio column to GuideLLM'saudio_column
The above command benchmarks the audio/transcriptions endpoint on the target server using audio from the LibriSpeech dataset for ASR. It will result in an output similar to the following:
✔ OpenAIHTTPBackend backend validated with model openai/whisper-small
......
......
✔ Setup complete, starting benchmarks...
......
......
ℹ Audio Metrics Statistics (Completed Requests)
|=============|=======|========|========|========|=========|=========|==========|==========|======|=======|======|======|=========|==========|==========|==========|
| Benchmark | Input Tokens |||| Input Samples |||| Input Seconds |||| Input Bytes ||||
| Strategy | Per Request || Per Second || Per Request || Per Second || Per Request || Per Second || Per Request || Per Second ||
| | Mdn | p95 | Mdn | Mean | Mdn | p95 | Mdn | Mean | Mdn | p95 | Mdn | Mean | Mdn | p95 | Mdn | Mean |
|-------------|-------|--------|--------|--------|---------|---------|----------|----------|------|-------|------|------|---------|----------|----------|----------|
| synchronous | 642.0 | 1688.0 | 7565.1 | 7329.1 | 16000.0 | 16000.0 | 129722.1 | 141848.5 | 6.4 | 16.8 | 75.3 | 72.9 | 52172.0 | 135692.0 | 610195.0 | 592749.4 |
|=============|=======|========|========|========|=========|=========|==========|==========|======|=======|======|======|=========|==========|==========|==========|
......
......
✔ Benchmarking complete, generated 1 benchmark(s)
… json : /Users/markkurtz/code/github/vllm-project/guidellm/benchmarks.json
… csv : /Users/markkurtz/code/github/vllm-project/guidellm/benchmarks.csv
… html : /Users/markkurtz/code/github/vllm-project/guidellm/benchmarks.html
2. Audio Translation
This benchmark tests audio translation models like Whisper at converting audio in one language to text in another. Use the Whisper vLLM serving command above or a similar model that supports the audio translation endpoint. For this example, we use only the audio data from the LibriSpeech dataset; however, a prompt can also be provided if desired, as shown in the Audio Chat example.
Command:
guidellm run \
--backend '{"kind":"openai_http","target":"http://localhost:8000","request_format":"/v1/audio/translations","extras":{"body":{"language":"fr"}}}' \
--profile kind=synchronous \
--constraint kind=max_requests,count=20 \
--data '{"kind":"huggingface","source":"openslr/librispeech_asr","load_kwargs":{"name":"clean","split":"test"}}' \
--data-column-mapper '{"kind":"generative_column_mapper","column_mappings":{"audio_column":"audio"}}'
Key Parameters:
--backend: Server URL, translation endpoint, and target language viaextras.body--profile kind=synchronous: Sequential execution mode--constraint kind=max_requests,count=20: Limits the test to 20 requests--data: HuggingFace dataset withload_kwargsfor the "clean" config and "test" split. Seedatasets.load_datasetfor full list of valid options.--data-column-mapper: Identifies the audio column for audio processing
The above command benchmarks the audio/translations endpoint on the target server using audio from the LibriSpeech dataset and requesting translations to French. It will result in an output similar to the following:
✔ OpenAIHTTPBackend backend validated with model openai/whisper-small
......
......
✔ Setup complete, starting benchmarks...
......
......
ℹ Audio Metrics Statistics (Completed Requests)
|=============|=======|========|========|========|=========|=========|==========|==========|======|=======|======|======|=========|==========|==========|==========|
| Benchmark | Input Tokens |||| Input Samples |||| Input Seconds |||| Input Bytes ||||
| Strategy | Per Request || Per Second || Per Request || Per Second || Per Request || Per Second || Per Request || Per Second ||
| | Mdn | p95 | Mdn | Mean | Mdn | p95 | Mdn | Mean | Mdn | p95 | Mdn | Mean | Mdn | p95 | Mdn | Mean |
|-------------|-------|--------|--------|--------|---------|---------|----------|----------|------|-------|------|------|---------|----------|----------|----------|
| synchronous | 642.0 | 1688.0 | 7483.6 | 7563.5 | 16000.0 | 16000.0 | 133404.0 | 146385.0 | 6.4 | 16.8 | 74.5 | 75.2 | 52172.0 | 135692.0 | 603620.5 | 611706.4 |
|=============|=======|========|========|========|=========|=========|==========|==========|======|=======|======|======|=========|==========|==========|==========|
......
......
✔ Benchmarking complete, generated 1 benchmark(s)
… json : /Users/markkurtz/code/github/vllm-project/guidellm/benchmarks.json
… csv : /Users/markkurtz/code/github/vllm-project/guidellm/benchmarks.csv
… html : /Users/markkurtz/code/github/vllm-project/guidellm/benchmarks.html
3. Audio Chat Completions
This benchmark tests models that can handle audio inputs in a conversational format, such as Ultravox. Use the Ultravox vLLM serving command above, or a similar model that supports audio formats in chat-completion pathways. In addition to the LibriSpeech dataset, the following example adds a synthetic dataset for text prompts. Replace the datasets and column mappings as needed for your use case.
Command:
guidellm run \
--backend kind=openai_http,target=http://localhost:8000,request_format=/v1/chat/completions \
--profile kind=synchronous \
--constraint kind=max_requests,count=20 \
--data kind=synthetic_text,prompt_tokens=256,output_tokens=128 \
--data '{"kind":"huggingface","source":"openslr/librispeech_asr","load_kwargs":{"name":"clean","split":"test"}}' \
--data-column-mapper '{"kind":"generative_column_mapper","column_mappings":{"audio_column":"1.audio","text_column":"0.prompt"}}'
Key Parameters
--backend: Server URL and chat completions endpoint for multimodal inputs--profile kind=synchronous: Sequential execution--constraint kind=max_requests,count=20: Limits to 20 requests--data: Specified twice — first for synthetic prompts (kind=synthetic_text), second for real audio fromopenslr/librispeech_asr(kind=huggingfacewithload_kwargsfor dataset config)--data-column-mapper: Maps audio from dataset index 1 ("1.audio", LibriSpeech) and text from dataset index 0 ("0.prompt", synthetic prompts) into each request.
The above command benchmarks the chat/completions endpoint on the target server using the prompt text from the synthetic dataset and audio from the LibriSpeech dataset. It will result in an output similar to the following:
✔ OpenAIHTTPBackend backend validated with model fixie-ai/ultravox-v0_5-llama-3_2-1b
......
......
✔ Setup complete, starting benchmarks...
......
......
ℹ Audio Metrics Statistics (Completed Requests)
|=============|=======|========|========|========|=========|=========|==========|==========|======|=======|======|======|=========|==========|==========|==========|
| Benchmark | Input Tokens |||| Input Samples |||| Input Seconds |||| Input Bytes ||||
| Strategy | Per Request || Per Second || Per Request || Per Second || Per Request || Per Second || Per Request || Per Second ||
| | Mdn | p95 | Mdn | Mean | Mdn | p95 | Mdn | Mean | Mdn | p95 | Mdn | Mean | Mdn | p95 | Mdn | Mean |
|-------------|-------|--------|--------|--------|---------|---------|----------|----------|------|-------|------|------|---------|----------|----------|----------|
| synchronous | 642.0 | 1688.0 | 7565.1 | 7329.1 | 16000.0 | 16000.0 | 129722.1 | 141848.5 | 6.4 | 16.8 | 75.3 | 72.9 | 52172.0 | 135692.0 | 610195.0 | 592749.4 |
|=============|=======|========|========|========|=========|=========|==========|==========|======|=======|======|======|=========|==========|==========|==========|
ℹ GuideLLM Request Metrics Statistics (Completed Requests)
|=============|=======|=======|=======|=======|======|=====|=======|=======|=======|=======|======|=====|=======|=======|=======|=======|======|=====|
| Benchmark | Request Latency (ms) ||||| Output Tokens / Sec ||||| Time to First Token (ms) ||||| Time per Output Token (ms) |||||
| Strategy | Mdn | Mean | p50 | p90 | p95 | p99 | Mdn | Mean | p50 | p90 | p95 | p99 | Mdn | Mean | p50 | p90 | p95 | p99 | Mdn | Mean | p50 | p90 | p95 | p99 |
|-------------|-------|-------|-------|-------|------|-----|-------|-------|-------|-------|------|-----|-------|-------|-------|-------|------|-----|-------|-------|-------|-------|------|-----|
| synchronous | 125.4 | 130.2 | 125.4 | 145.1 | 150.2| 160.5| 45.2 | 44.8 | 45.2 | 42.1 | 41.5 | 40.2| 25.1 | 26.5 | 25.1 | 30.2 | 32.5 | 35.1| 22.1 | 22.3 | 22.1 | 23.7 | 24.1 | 24.8|
|=============|=======|=======|=======|=======|======|=====|=======|=======|=======|=======|======|=====|=======|=======|=======|=======|======|=====|
✔ Benchmarking complete, generated 1 benchmark(s)
… json : /Users/markkurtz/code/github/vllm-project/guidellm/benchmarks.json
… csv : /Users/markkurtz/code/github/vllm-project/guidellm/benchmarks.csv
… html : /Users/markkurtz/code/github/vllm-project/guidellm/benchmarks.html