Over-Saturation Stopping
GuideLLM provides over-saturation detection (OSD) to automatically stop benchmarks when a model becomes over-saturated. This feature helps prevent wasted compute resources and ensures that benchmark results remain valid by detecting when the response rate can no longer keep up with the request rate.
What is Over-Saturation?
Over-saturation occurs when an LLM inference server receives requests faster than it can process them, causing a queue to build up. As the queue grows, the server takes progressively longer to start handling each request, leading to degraded performance metrics. When a performance benchmarking tool oversaturates an LLM inference server, the metrics it measures become significantly skewed, rendering them useless.
Think of it like a cashier getting flustered during a sudden rush. As the line grows (the load), the cashier can't keep up, the line gets longer, and there is no room for additional customers. This waste of costly machine time can be prevented by automatically detecting and stopping benchmarks when over-saturation is detected.
How It Works
GuideLLM's Over-Saturation Detection (OSD) algorithm uses statistical slope detection to identify when a model becomes over-saturated. The algorithm tracks two key metrics over time:
- Concurrent Requests: The number of requests being processed simultaneously
- Time-to-First-Token (TTFT): The latency for the first token of each response
For each metric, the algorithm:
- Maintains a sliding window of recent data points
- Calculates the linear regression slope using online statistics
- Computes the margin of error (MOE) using t-distribution confidence intervals
- Detects positive slopes with low MOE, indicating degradation
Over-saturation is detected when:
- Both concurrent requests and TTFT show statistically significant positive slopes
- The minimum duration threshold has been met
- Sufficient data points are available for reliable slope estimation
When over-saturation is detected, the constraint automatically stops request queuing and optionally stops processing of existing requests, preventing further resource waste.
Usage
Over-saturation is configured with kind=over_saturation in the constraint configuration:
Basic Usage
Enable over-saturation detection with default settings:
guidellm run \
--backend kind=openai_http,target=http://localhost:8000 \
--profile kind=throughput,max_concurrency=10 \
--constraint kind=over_saturation
Advanced Configuration
Configure detection parameters in the constraint config string:
guidellm run \
--backend kind=openai_http,target=http://localhost:8000 \
--profile kind=concurrent,streams=16 \
--constraint kind=over_saturation,mode=enforce,min_seconds=60,max_window_seconds=300,moe_threshold=1.5
Or with JSON:
--constraint '{"kind":"over_saturation","mode":"enforce","min_seconds":60,"max_window_seconds":300,"moe_threshold":1.5}'
Configuration Options
The following parameters can be configured when enabling over-saturation detection:
mode("enforce"|"monitor", default:"enforce"): Controls behavior when over-saturation is detected."enforce"stops the benchmark,"monitor"only monitors and reportsmin_seconds(float, default:30.0): Minimum seconds before checking for over-saturation. This prevents false positives during the initial warm-up phase.max_window_seconds(float, default:120.0): Maximum time window in seconds for data retention. Older data points are automatically pruned to maintain bounded memory usage.moe_threshold(float, default:2.0): Margin of error threshold for slope detection. Lower values make detection more sensitive to degradation.minimum_ttft(float, default:2.5): Minimum TTFT threshold in seconds for violation counting. Only TTFT values above this threshold are counted as violations.maximum_window_ratio(float, default:0.75): Maximum window size as a ratio of total requests. Limits memory usage by capping the number of tracked requests.minimum_window_size(int, default:5): Minimum data points required for slope estimation. Ensures statistical reliability before making detection decisions.confidence(float, default:0.95): Statistical confidence level for t-distribution calculations (0-1). Higher values require stronger evidence before detecting over-saturation.
Use Cases
Over-saturation detection is particularly useful in the following scenarios:
Stress Testing and Capacity Planning
When testing how your system handles increasing load, over-saturation detection automatically stops benchmarks once the system can no longer keep up, preventing wasted compute time on invalid results.
guidellm run \
--backend kind=openai_http,target=http://localhost:8000 \
--profile kind=sweep,sweep_size=5 \
--constraint kind=over_saturation
Cost-Effective Benchmarking
When running large-scale benchmark matrices across multiple models, GPUs, and configurations, over-saturation detection can significantly reduce costs by stopping invalid runs early.
Finding Safe Operating Ranges
Use over-saturation detection to identify the maximum sustainable throughput for your deployment, helping you set appropriate rate limits and capacity planning targets.
Interpreting Results
When over-saturation detection is configured (in either "enforce" or "monitor" mode), the benchmark output includes metadata about the detection state. This metadata is available in the scheduler action metadata and includes:
is_over_saturated(bool): Whether over-saturation was detected at the time of evaluationconcurrent_slope(float): The calculated slope for concurrent requestsconcurrent_slope_moe(float): The margin of error for the concurrent requests slopeconcurrent_n(int): The number of data points used for concurrent requests slope calculationttft_slope(float): The calculated slope for TTFTttft_slope_moe(float): The margin of error for the TTFT slopettft_n(int): The number of data points used for TTFT slope calculationttft_violations(int): The count of TTFT values exceeding the minimum threshold
These metrics can help you understand why over-saturation was detected and fine-tune the detection parameters if needed.
Example: Complete Benchmark with Over-Saturation Detection
guidellm run \
--backend kind=openai_http,target=http://localhost:8000 \
--profile kind=concurrent,streams=16 \
--data kind=synthetic_text,prompt_tokens=256,output_tokens=128 \
--constraint kind=max_duration,seconds=300 \
--constraint kind=over_saturation,mode=enforce,min_seconds=30,max_window_seconds=120 \
--output kind=json,path=benchmark.json \
--output kind=html,path=benchmark.html
This example:
- Runs a concurrent benchmark with 16 simultaneous requests
- Uses synthetic data with 256 prompt tokens and 128 output tokens
- Enables over-saturation detection with custom timing parameters
- Sets a maximum duration of 300 seconds (as a fallback)
- Outputs results in both JSON and HTML formats
Additional Resources
For more in-depth information about over-saturation detection, including the algorithm development, evaluation metrics, and implementation details, see the following Red Hat Developer blog posts:
- Reduce LLM benchmarking costs with oversaturation detection - An introduction to the problem of over-saturation and why it matters for LLM benchmarking
- Defining success: Evaluation metrics and data augmentation for oversaturation detection - How to evaluate the performance of an OSD algorithm through custom metrics, dataset labeling, and load augmentation techniques
- Building an oversaturation detector with iterative error analysis - A detailed walkthrough of how the OSD algorithm was built