Supported Models¶
The vLLM Spyre plugin relies on model code implemented by the Foundation Model Stack.
Verified Deployment Configurations¶
The following models have been verified to run on vLLM Spyre with the listed configurations. These tables are automatically generated from the model configuration file.
Generative Models¶
Models with continuous batching support for text generation tasks.
ibm-granite/granite-3.3-8b-instruct
| Max Model Len | Max Num Seqs | Tensor Parallel Size |
|---|---|---|
| 3072 | 16 | 1 |
| 8192 | 4 | 1 |
| 8192 | 4 | 2 |
| 32768 | 32 | 4 |
ibm-granite/granite-3.3-8b-instruct-FP8
| Max Model Len | Max Num Seqs | Tensor Parallel Size |
|---|---|---|
| 3072 | 16 | 1 |
| 16384 | 4 | 4 |
| 32768 | 32 | 4 |
meta-llama/Llama-3.1-8B-Instruct
| Max Model Len | Max Num Seqs | Tensor Parallel Size |
|---|---|---|
| 3072 | 16 | 1 |
| 16384 | 4 | 4 |
| 32768 | 32 | 4 |
ibm-granite/granite-4-8b-dense
| Max Model Len | Max Num Seqs | Tensor Parallel Size |
|---|---|---|
| 3072 | 16 | 1 |
| 8192 | 4 | 1 |
| 8192 | 4 | 2 |
| 32768 | 32 | 4 |
mistralai/Mistral-Small-3.2-24B-Instruct-2506
| Max Model Len | Max Num Seqs | Tensor Parallel Size |
|---|---|---|
| 8192 | 32 | 2 |
| 32768 | 32 | 4 |
Pooling Models¶
Models with static batching support for embedding and scoring tasks.
ibm-granite/granite-embedding-125m-english
| VLLM_SPYRE_WARMUP_BATCH_SIZES | VLLM_SPYRE_WARMUP_PROMPT_LENS | Tensor Parallel Size |
|---|---|---|
| 64 | 512 | 1 |
ibm-granite/granite-embedding-278m-multilingual
| VLLM_SPYRE_WARMUP_BATCH_SIZES | VLLM_SPYRE_WARMUP_PROMPT_LENS | Tensor Parallel Size |
|---|---|---|
| 64 | 512 | 1 |
intfloat/multilingual-e5-large
| VLLM_SPYRE_WARMUP_BATCH_SIZES | VLLM_SPYRE_WARMUP_PROMPT_LENS | Tensor Parallel Size |
|---|---|---|
| 64 | 512 | 1 |
| VLLM_SPYRE_WARMUP_BATCH_SIZES | VLLM_SPYRE_WARMUP_PROMPT_LENS | Tensor Parallel Size |
|---|---|---|
| 1 | 8192 | 1 |
| VLLM_SPYRE_WARMUP_BATCH_SIZES | VLLM_SPYRE_WARMUP_PROMPT_LENS | Tensor Parallel Size |
|---|---|---|
| 64 | 512 | 1 |
sentence-transformers/all-roberta-large-v1
| VLLM_SPYRE_WARMUP_BATCH_SIZES | VLLM_SPYRE_WARMUP_PROMPT_LENS | Tensor Parallel Size |
|---|---|---|
| 8 | 128 | 1 |
Model Configuration¶
The Spyre engine uses a model registry to manage model-specific configurations. Model configurations are defined in vllm_spyre/config/model_configs.yaml and include:
- Architecture patterns for model matching
- Device-specific configurations (environment variables, GPU block overrides)
- Supported runtime configurations (static batching warmup shapes, continuous batching parameters)
When a model is loaded, the registry automatically matches it to the appropriate configuration and applies model-specific settings.
Configuration Validation¶
By default, the Spyre engine will log warnings if a requested model or configuration is not found in the registry. To enforce strict validation and fail if an unknown configuration is requested, set the environment variable:
When this flag is enabled, the engine will raise a RuntimeError if:
- The model cannot be matched to a known configuration
- The requested runtime parameters are not in the supported configurations list
See the Configuration Guide for more details on model configuration.