Skip to content

vLLM Python Backend

The vLLM Python backend (vllm_python) runs inference in the same process as GuideLLM using vLLM's AsyncLLMEngine. No HTTP server is involved, eliminating network overhead. This is useful for isolating performance bottlenecks or simplifying your benchmark setup. You do not pass a target; you must pass model in the backend configuration, which will then be downloaded and hosted in vLLM.

For all engine options and supported models, see vLLM's Engine Arguments and the vLLM documentation.

Installation

  • Official GuideLLM + vLLM image\ Build and run the image that uses the vLLM base image (e.g. Containerfile.vllm). It is based on vllm/vllm-openai and installs GuideLLM on top, giving a known-good vLLM + GuideLLM stack with hardware support as provided by the base image.

Note: This method will result in the preference for vllm's requirements as opposed to GuideLLM's requirements. Since vLLM is the more complex project, this is the recommended configuration, but this may result in an older Python or dependency version, resulting in sub-optimal GuideLLM performance and behavior in some scenarios.

  • Existing vLLM installation\ Install vLLM first for your environment (GPU/CPU, CUDA, etc.), then install GuideLLM in the same environment (e.g. pip install guidellm or with extras). You avoid a duplicate vLLM install and reuse your existing acceleration setup.

Note: Installing from the lockfile for the vLLM Python backend may not install the correct dependencies for hardware acceleration.

Basic example

Run a benchmark with the vLLM Python backend:

guidellm run \
  --backend kind=vllm_python,model=Qwen/Qwen3-0.6B \
  --data kind=synthetic_text,prompt_tokens=256,output_tokens=128 \
  --profile kind=constant,rate=3 \
  --constraint kind=max_duration,seconds=20

Engine behavior (device, memory, etc.) follows vLLM defaults unless you override it via vllm_config in the backend configuration. When running without a GPU (e.g. the GuideLLM + vLLM container without GPU access), the backend automatically uses the CPU device unless you set device in vllm_config. For engine configuration options, see vLLM's Engine Arguments.

Request format and backend options

  • request_format\ Controls how chat prompts are built. Options: plain (no chat template; message content is concatenated as plain text), default-template (use the tokenizer's default chat template), or a file path / single-line template string per vLLM's supported options. The value is passed through to vLLM's chat template handling. For details, see vLLM's Chat templates documentation.

  • vllm_config\ Backend-specific engine options are passed as a nested dict in the backend config. Pass a vllm_config key whose value is a dict of engine option names and values.

Using Engine Arguments in vllm_config: The Engine Arguments documentation describes options in CLI form (e.g. --gpu-memory-utilization, --max-model-len). For vllm_config you must use the Python parameter names instead: strip the leading -- and replace dashes with underscores (e.g. gpu_memory_utilization, max_model_len). The keys are the same as the field names on vLLM's EngineArgs and AsyncEngineArgs dataclasses; for the exact list of allowed keys and types, see the vLLM source: vllm/engine/arg_utils.py (search for class EngineArgs).

Example — limit GPU memory use and context length:

--backend '{"kind":"vllm_python","model":"Qwen/Qwen3-0.6B","vllm_config":{"gpu_memory_utilization":0.8,"max_model_len":4096}}'

For the full list of options and their types, see vLLM's Engine Arguments (CLI form) and the EngineArgs source (Python field names for vllm_config).

[!IMPORTANT]

The model field in the backend configuration is required for vllm_python. If model is also set inside vllm_config, the top-level model field takes precedence.

See also