Skip to content

Embeddings Benchmarking Guide

GuideLLM supports benchmarking OpenAI-compatible embeddings endpoints to measure performance characteristics like throughput, latency, and concurrency.

Overview

Embeddings models convert text into dense vector representations used for semantic search, retrieval, and similarity tasks. Unlike generative models that produce text output, embeddings models:

Process input text and return vector embeddings
Do not support streaming (single response per request)
Track only input tokens (no output tokens)
Measure request latency and throughput

Quick Start

# Start vLLM server with an embeddings model
vllm serve BAAI/bge-small-en-v1.5 --port 8000

# Run benchmark
guidellm run \
  --backend kind=openai_http,target=http://localhost:8000/v1,model=BAAI/bge-small-en-v1.5,request_format=/v1/embeddings \
  --data kind=synthetic_text,prompt_tokens=128 \
  --constraint kind=max_requests,count=100

Key Differences from Generative Benchmarks

Feature	Embeddings	Generative
Output	Vector embeddings	Generated text
Streaming	No	Yes
Output tokens	Not applicable	Variable
TTFT	Not applicable	Measured
Token latency	Not applicable	Measured
Primary metrics	Latency, throughput	TTFT, ITL, throughput

See Also

Benchmark Profiles - Detailed explanation of all profile types
Datasets Guide - Creating and using custom datasets
Metrics Guide - Understanding performance metrics