Skip to content

Datasets

GuideLLM supports various dataset configurations to enable benchmarking and evaluation of large language models (LLMs). This document provides a comprehensive guide to configuring datasets for different use cases, along with detailed examples and rationale for choosing specific pathways.

Data Arguments Overview

Dataset and processing options use the "kind" pattern to select a registered implementation and configure parameters:

guidellm run --data kind=<TYPE>,key=value,... [other options...]

Repeat --data to combine multiple sources. Use comma-separated key=value pairs for flat settings, or serialized JSON when any value is nested.

Dataset Source and Type

The following arguments configure datasets and their processing:

  • Dataset source and type. Accepted types:
  • synthetic_text — generates synthetic prompts on the fly. Required field: prompt_tokens. Optional: output_tokens, turns, prefix_tokens, prefix_count, prefix_buckets, and distribution controls (prompt_tokens_stdev, output_tokens_stdev, etc.).
  • huggingface (alias hf) — loads from HuggingFace Hub or a local directory/file. Required field: source (dataset ID or path). Pass dataset loading arguments (for example split, name) via load_kwargs.
  • json_file, csv_file, text_file, parquet_file, arrow_file, hdf5_file, db_file, tar_file — loads from a local file. Required field: path.
  • trace_synthetic — loads a JSONL trace file for replay benchmarking. Required field: path. Optional: timestamp_column (default: timestamp), prompt_tokens_column (default: input_length), output_tokens_column (default: output_length).

In addition, you can specify additional arguments to the dataset loading with the data argument loader_kwargs:

  • loader_kwargs: Additional arguments to the dataset loading. For example, dataset splits can be specified with --data '{"kind":"huggingface","source":"my/dataset","loader_kwargs":{"split":"test"}}'.

Data Loader

You can specify the data loader with --data-loader and a configuration string. The only current type is pytorch:

  • --data-loader kind=pytorch,samples=1000,num_workers=4,shuffle=true

Tokenizer

You can specify the tokenizer with --tokenizer and a configuration string. The only current type is huggingface_auto:

  • --tokenizer '{"kind":"huggingface_auto","model":"path/to/processor","load_kwargs":{"use_fast":false}}'

If a HuggingFace model requires custom code to load, pass trust_remote_code in load_kwargs. See Troubleshooting: trust_remote_code.

If your dataset uses non-standard column names, you can use --data-column-mapper to map your columns to GuideLLM's expected column names. This is particularly useful when:

  1. Your dataset uses different column names (e.g., question instead of prompt, instruction instead of text_column)
  2. You have multiple datasets and need to specify which dataset's columns to use
  3. Your dataset has system prompts or prefixes in a separate column

Column Mapping

The --data-column-mapper accepts a string mapping column types to column names:

--data-column-mapper '{"kind":"generative_column_mapper","column_mappings":{"text_column":"prompt"}}'

Preprocessors

With --data-preprocessor you can control the GuideLLM preprocessors to apply: preprocessors include encode_media, tool_calling_message_extractor, turn_pivot.

Data Finalizer

With --data-finalizer you can control how dataset rows are converted to requests (default: generative).

Random Seed

Use --seed with a configuration string to control how sequences (in scheduling and synthetic data generation) are randomized: For example, --seed kind=static,value=42.

Example Usage

guidellm run \
  --backend kind=openai_http,target=http://localhost:8000 \
  --profile kind=throughput \
  --constraint kind=max_requests,count=1000 \
  --data '{"kind":"huggingface","source":"my/dataset","load_kwargs":{"split":"test"}}' \
  --data-column-mapper '{"kind":"generative_column_mapper","column_mappings":{"text_column":"prompt"}}' \
  --tokenizer '{"kind":"huggingface_auto","model":"path/to/processor","load_kwargs":{"use_fast":false}}' \
  --data-loader kind=pytorch,shuffle=true

Dataset Types

GuideLLM supports several types of datasets, each with its own advantages and use cases. Below are the main dataset types supported by GuideLLM, including synthetic data, Hugging Face datasets, file-based datasets, and in-memory datasets.

Synthetic Data

Synthetic datasets allow you to generate data on the fly with customizable parameters. This is useful for controlled experiments, stress testing, and simulating specific scenarios. For example, you might want to evaluate how a model handles long prompts or generates outputs with specific characteristics.

Example Commands

guidellm run \
  --backend kind=openai_http,target=http://localhost:8000 \
  --profile kind=throughput \
  --constraint kind=max_requests,count=1000 \
  --data kind=synthetic_text,prompt_tokens=256,output_tokens=128

Or using a JSON config string:

guidellm run \
  --backend kind=openai_http,target=http://localhost:8000 \
  --profile kind=throughput \
  --constraint kind=max_requests,count=1000 \
  --data '{"kind":"synthetic_text","prompt_tokens":256,"output_tokens":128}'

Configuration Options

  • prompt_tokens: Average number of tokens in prompts. If nothing else is specified, all requests will have this number of tokens.
  • prompt_tokens_stdev: Standard deviation for prompt tokens. If not supplied and min/max are not specified, no deviation is applied. If not supplied and min/max are specified, a uniform distribution is used.
  • prompt_tokens_min: Minimum number of tokens in prompts. If unset and prompt_tokens_stdev is set, the minimum is 1.
  • prompt_tokens_max: Maximum number of tokens in prompts. If unset and prompt_tokens_stdev is set, the maximum is 5 times the standard deviation.
  • output_tokens: Average number of tokens in outputs. Optional; omit when benchmarking endpoints that do not produce output tokens (for example embeddings).
  • output_tokens_stdev: Standard deviation for output tokens. If not supplied and min/max are not specified, no deviation is applied. If not supplied and min/max are specified, a uniform distribution is used.
  • output_tokens_min: Minimum number of tokens in outputs. If unset and output_tokens_stdev is set, the minimum is 1.
  • output_tokens_max: Maximum number of tokens in outputs. If unset and output_tokens_stdev is set, the maximum is 5 times the standard deviation.
  • samples: Number of samples to generate (default: 1000). More samples will increase the time taken to generate the dataset before benchmarking, but will also decrease the likelihood of caching requests.
  • source: Source text for generation (default: data:prideandprejudice.txt.gz). This can be any text file, URL containing a text file, or a compressed text file. The text is used to sample from at a word and punctuation granularity and then combined into a single string of the desired lengths.

Notes

  • A tokenizer is required. By default, the model from the backend is used. If unavailable, use --tokenizer kind=huggingface_auto,model=<path-or-id> to specify a directory or Hugging Face model ID containing the tokenizer files.

Hugging Face Datasets

GuideLLM supports datasets from the Hugging Face Hub or local directories that follow the datasets library format. This allows you to easily leverage a wide range of datasets for benchmarking and evaluation with real-world data.

Example Commands

guidellm run \
  --backend kind=openai_http,target=http://localhost:8000 \
  --profile kind=throughput \
  --constraint kind=max_requests,count=1000 \
  --data kind=huggingface,source=garage-bAInd/Open-Platypus

Or using a local dataset directory:

guidellm run \
  --backend kind=openai_http,target=http://localhost:8000 \
  --profile kind=throughput \
  --constraint kind=max_requests,count=1000 \
  --data kind=huggingface,source=path/to/dataset

Pass HuggingFace load_dataset options through load_kwargs:

--data '{"kind":"huggingface","source":"my/dataset","load_kwargs":{"split":"train"}}'

Notes

  • Hugging Face datasets can be specified by ID, a local directory, or a path to a local Python file.
  • A supported Hugging Face datasets format is defined as one that can be loaded using the datasets library with the load_dataset function and therefore it is representable as a Dataset, DatasetDict, IterableDataset, or IterableDatasetDict. More information on the supported data types and additional args for the underlying use of load_dataset can be found in the Hugging Face datasets documentation.
  • A tokenizer is only required if GUIDELLM__PREFERRED_PROMPT_TOKENS_SOURCE="local" or GUIDELLM__PREFERRED_OUTPUT_TOKENS_SOURCE="local" is set in the environment. In this case, the tokenizer must be specified using --tokenizer kind=huggingface_auto,model=.... If not set, the tokenizer will be set to the model passed in or retrieved from the backend.

File-Based Datasets

GuideLLM supports various file formats for datasets, including text, CSV, JSON, and more. These datasets can be used for benchmarking and evaluation, allowing you to work with structured data in a familiar format that matches your use case.

Supported Formats with Examples

  • Text files (.txt, .text): Where each line is a separate prompt to use.
Hello, how are you?
What is your name?
  • CSV files (.csv): Where each row is a separate dataset entry and the first row contains the column names. The columns should include prompt or other common names for the prompt which will be used as the prompt column. Additional columns can be included based on the previously mentioned aliases for the --data-column-mapper argument.
prompt,output_tokens_count,additional_column,additional_column2
Hello, how are you?,5,foo,bar
What is your name?,3,baz,qux
  • JSON Lines files (.jsonl): Where each line is a separate JSON object. The objects should include prompt or other common names for the prompt which will be used as the prompt column. Additional fields can be included based on the previously mentioned aliases for the --data-column-mapper argument.
{"prompt": "Hello, how are you?", "output_tokens_count": 5, "additional_column": "foo", "additional_column2": "bar"}
{"prompt": "What is your name?", "output_tokens_count": 3, "additional_column": "baz", "additional_column2": "qux"}
  • Trace files (.jsonl with trace_synthetic type): Specialized JSONL files for replay benchmarking with timestamp, input_length, and output_length fields. Used with --profile kind=replay to replay trace events using each row's timestamp and token lengths. Timestamps must be numbers expressed in seconds on a shared timeline with any consistent zero point; GuideLLM sorts them and converts them to offsets from the first event before scheduling. Date strings are not parsed yet, so provide timestamps as numbers. See Trace Replay Benchmarking.
{"timestamp": 1234500.0, "input_length": 256, "output_length": 128}
{"timestamp": 1234500.5, "input_length": 512, "output_length": 64}

In this example, the second request is scheduled 0.5 seconds after the first request. Trace rows are ordered by timestamp before GuideLLM schedules requests and generates synthetic payloads. This keeps each scheduled event aligned with the prompt and output token lengths from the same row.

Use trace_synthetic to enable trace loading:

guidellm run \
  --backend kind=openai_http,target=http://localhost:8000 \
  --profile kind=replay,time_scale=1.0 \
  --data kind=trace_synthetic,path=path/to/trace.jsonl

If your trace uses different column names, include timestamp_column, prompt_tokens_column, and output_tokens_column in the data config:

guidellm run \
  --backend kind=openai_http,target=http://localhost:8000 \
  --profile kind=replay,time_scale=1.0 \
  --data kind=trace_synthetic,path=replay.jsonl,timestamp_column=timestamp,prompt_tokens_column=input_length,output_tokens_column=output_length

For replay, time_scale on the profile is a time scale for the intervals between trace events rather than requests per second. Use --data-loader kind=pytorch,samples=1000 to limit how many trace rows are loaded and replayed. Use --constraint kind=max_requests,count=<n> only as a runtime completion constraint; it does not limit the trace rows loaded from the file.

Very small input_length values (roughly under 15 tokens, depending on the tokenizer) may not leave enough room for the full per-row unique prefix in the synthetic prompt. This can make prompts more similar across rows and weaken cache resistance. See Trace Replay Benchmarking for details.

  • JSON files (.json): Where the entire dataset is represented as a JSON array of objects nested under a specific key. To surface the correct key to use, a --data-column-mapper argument must be passed in of "field": "NAME" for where the array exists. The objects should include prompt or other common names for the prompt which will be used as the prompt column. Additional fields can be included based on the previously mentioned aliases for the --data-column-mapper argument.
{
  "version": "1.0",
  "data": [
    {"prompt": "Hello, how are you?", "output_tokens_count": 5, "additional_column": "foo", "additional_column2": "bar"},
    {"prompt": "What is your name?", "output_tokens_count": 3, "additional_column": "baz", "additional_column2": "qux"}
  ]
}
  • Parquet files (.parquet) Example: A binary columnar storage format for efficient data processing. For more information on the supported formats, see the Hugging Face dataset documentation linked in the Notes section.

  • Arrow files (.arrow) Example: A cross-language development platform for in-memory data. For more information on the supported formats, see the Hugging Face dataset documentation linked in the Notes section.

  • HDF5 files (.hdf5) Example: A hierarchical data format for storing large amounts of data. For more information on the supported formats, see the Hugging Face dataset documentation linked in the Notes section.

Example Commands

guidellm run \
  --backend kind=openai_http,target=http://localhost:8000 \
  --profile kind=throughput \
  --constraint kind=max_requests,count=1000 \
  --data kind=json_file,path=path/to/dataset.json \
  --data-column-mapper '{"kind":"generative_column_mapper","column_mappings":{"text_column":"prompt"}}'

Replace json_file with csv_file, text_file, parquet_file, arrow_file, or hdf5_file for other formats.

Notes

  • Ensure the file format matches the expected structure for the dataset and is listed as a supported format.
  • The --data-column-mapper argument can be used to specify additional parameters for parsing the dataset, such as the prompt column name or the split to use.
  • More information on the supported formats and additional args for the underlying use of load_dataset can be found in the Hugging Face datasets documentation.

In-Memory Datasets

In-memory datasets allow you to directly pass data as Python objects, making them ideal for quick prototyping and testing without the need to save data to disk.

Supported Formats with Examples

  • Dictionary of columns and values: Where each key is a column name and the values are lists of data points. The keys should include prompt or other common names for the prompt which will be used as the prompt column. Additional columns can be included based on the previously mentioned aliases for the --data-column-mapper argument.
    {
        "column1": ["value1", "value2"],
        "column2": ["value3", "value4"]
    }
    
  • List of dictionaries: Where each dictionary represents a single data point with key-value pairs. The dictionaries should include prompt or other common names for the prompt which will be used as the prompt column. Additional fields can be included based on the previously mentioned aliases for the --data-column-mapper argument.
    [
        {"column1": "value1", "column2": "value3"},
        {"column1": "value2", "column2": "value4"}
    ]
    
  • List of items: Where each item is a single data point. The items should include prompt or other common names for the prompt which will be used as the prompt column. Additional fields can be included based on the previously mentioned aliases for the --data-column-mapper argument.
    [
        "value1",
        "value2",
        "value3"
    ]
    

Example Usage

from guidellm.benchmark import benchmark_generative_text

data = [
    {"prompt": "Hello", "output": "Hi"},
    {"prompt": "How are you?", "output": "I'm fine."}
]

benchmark_generative_text(data=data, ...)

Notes

  • Ensure that the data format is consistent and adheres to one of the supported structures.
  • For dictionaries, all columns must have the same number of samples.
  • For lists of dictionaries, all items must have the same keys.
  • For lists of items, all elements must be of the same type.
  • A tokenizer is only required if GUIDELLM__PREFERRED_PROMPT_TOKENS_SOURCE="local" or GUIDELLM__PREFERRED_OUTPUT_TOKENS_SOURCE="local" is set in the environment. In this case, the tokenizer must be specified using --tokenizer kind=huggingface_auto,model=.... If not set, the tokenizer will be set to the model passed in or retrieved from the server.

Preprocessing Datasets

GuideLLM provides a preprocessing command that allows you to process datasets to have specific prompt and output token sizes. This is particularly useful when you need to standardize your dataset for benchmarking or when your dataset has prompts that don't match your target token requirements.

The preprocessing command can:

  • Resize prompts to target token lengths
  • Handle prompts that are shorter or longer than the target length using various strategies
  • Map columns from your dataset to GuideLLM's expected column names
  • Generate output token counts based on your configuration
  • Save the processed dataset in various formats

Basic Usage

guidellm preprocess dataset \
    kind=<DATA_TYPE>,key=value,... \
    <OUTPUT_PATH> \
    --processor <PROCESSOR> \
    --config <CONFIG>

Required Arguments

Argument Description
DATA Identify the dataset to process. Supports all dataset formats documented in the Dataset Configurations.
OUTPUT_PATH Path to save the processed dataset, including file suffix (e.g., processed_dataset.jsonl, output.csv).
--processor Required. Processor or tokenizer name/path for calculating token counts. Can be a Hugging Face model ID or local path.
--config Required. Configuration specifying target token sizes. Can be a JSON string, key=value pairs, or file path (.json, .yaml, .yml, .config).

Example

guidellm preprocess dataset \
    kind=json_file,path=path/to/input_dataset.jsonl \
    "path/to/processed_dataset.jsonl" \
    --processor "gpt2" \
    --config "prompt_tokens=512,output_tokens=256,prefix_tokens_max=100"

Configuration and Processor Options

The --config parameter accepts a PreprocessDatasetConfig as a JSON string, key=value pairs, or a configuration file path (.json, .yaml, .yml, .config). This configuration is similar to the synthetic data configuration but includes additional fields specific to preprocessing.

PreprocessDatasetConfig Options:

  • prompt_tokens: Average number of tokens in prompts. If nothing else is specified, all prompts will be resized to this number of tokens.
  • prompt_tokens_stdev: Standard deviation for prompt tokens. If not supplied and min/max are not specified, no deviation is applied. If not supplied and min/max are specified, a uniform distribution is used.
  • prompt_tokens_min: Minimum number of tokens in prompts. If unset and prompt_tokens_stdev is set, the minimum is 1.
  • prompt_tokens_max: Maximum number of tokens in prompts. If unset and prompt_tokens_stdev is set, the maximum is 5 times the standard deviation.
  • output_tokens: Average number of tokens in outputs. If nothing else is specified, all outputs will have this number of tokens.
  • output_tokens_stdev: Standard deviation for output tokens. If not supplied and min/max are not specified, no deviation is applied. If not supplied and min/max are specified, a uniform distribution is used.
  • output_tokens_min: Minimum number of tokens in outputs. If unset and output_tokens_stdev is set, the minimum is 1.
  • output_tokens_max: Maximum number of tokens in outputs. If unset and output_tokens_stdev is set, the maximum is 5 times the standard deviation.
  • prefix_tokens_max: Maximum number of prefix tokens to keep. If set, prefixes will be trimmed to this maximum length. If not set, prefixes are kept as-is (unless --include-prefix-in-token-count is used, which disables prefix trimming).

Example configurations:

# Using key=value pairs
--config "prompt_tokens=512,output_tokens=256,prefix_tokens_max=100"

# Using JSON string
--config '{"prompt_tokens": 512, "output_tokens": 256, "prefix_tokens_max": 100}'

# Using a configuration file
--config "path/to/config.json"

The --processor argument specifies the tokenizer to use for calculating token counts. This is required because the preprocessing command needs to tokenize prompts to ensure they match the target token sizes. For information about using tokenizers in benchmarks, see the Data Arguments Overview section.

Column Mapping

When your dataset uses non-standard column names, you can use --data-column-mapper to map your columns to GuideLLM's expected column names. This is particularly useful when:

  1. Your dataset uses different column names (e.g., question instead of prompt, instruction instead of text_column)
  2. You have multiple datasets and need to specify which dataset's columns to use
  3. Your dataset has system prompts or prefixes in a separate column

Column mapping format: The --data-column-mapper accepts a JSON string mapping column types to column names:

--data-column-mapper '{"kind":"generative_column_mapper","column_mappings": {
  "text_column": "question",
  "prefix_column": "system_prompt",
  "prompt_tokens_count_column": "input_tokens",
  "output_tokens_count_column": "completion_tokens"
}}'

Supported column types:

  • text_column: The main prompt text (defaults: prompt, instruction, question, input, context, content, text)
  • prefix_column: System prompt or prefix (defaults: system_prompt, system, prefix)
  • prompt_tokens_count_column: Column containing prompt token counts (defaults: prompt_tokens_count, input_tokens_count)
  • output_tokens_count_column: Column containing output token counts (defaults: output_tokens_count, completion_tokens_count)
  • image_column: Image data column
  • video_column: Video data column
  • audio_column: Audio data column

Example: Mapping custom column names

If your dataset has a CSV file with columns user_query and system_message:

user_query,system_message
"What is AI?","You are a helpful assistant."
"How does ML work?","You are a technical expert."

You would use:

guidellm preprocess dataset \
    "kind=csv_file,path=dataset.csv" \
    "processed.jsonl" \
    --processor "gpt2" \
    --config "prompt_tokens=512,output_tokens=256" \
    --data-column-mapper '{"kind":"generative_column_mapper","column_mappings": {"text_column": "user_query", "prefix_column": "system_message"}}'

Example: Multiple datasets

If you're working with multiple datasets and need to specify which dataset's columns to use, you can use the format <dataset_index>.<column_name> or <dataset_name>.<column_name>:

--data-column-mapper '{"kind":"generative_column_mapper","column_mappings": {"text_column": "0.prompt", "prefix_column": "1.system"}}'

Handling Short Prompts

When prompts are shorter than the target token length, you can specify how to handle them using --short-prompt-strategy:

Strategy Description
ignore Skip prompts that are shorter than the target length (default)
concatenate Concatenate multiple short prompts together until the target length is reached
pad Pad short prompts with a specified character to reach the target length
error Raise an error if a prompt is shorter than the target length

Example: Concatenating short prompts

guidellm preprocess dataset \
    "kind=json_file,path=dataset.jsonl" \
    "processed.jsonl" \
    --processor "gpt2" \
    --config "prompt_tokens=512,output_tokens=256" \
    --short-prompt-strategy "concatenate" \
    --concat-delimiter "\n\n"

Example: Padding short prompts

guidellm preprocess dataset \
    "kind=json_file,path=dataset.jsonl" \
    "processed.jsonl" \
    --processor "gpt2" \
    --config "prompt_tokens=512,output_tokens=256" \
    --short-prompt-strategy "pad" \
    --pad-char " "

Additional Options

Option Description
--data-args <JSON> JSON string of arguments to pass to dataset loading. See Data Arguments Overview for details.
--include-prefix-in-token-count Include prefix tokens in prompt token count calculation (flag). When enabled, prefix trimming is disabled and the prefix is kept as-is.
--random-seed <NUMBER> Random seed for reproducible token sampling (default: 42).
--push-to-hub Push the processed dataset to Hugging Face Hub (flag).
--hub-dataset-id <ID> Hugging Face Hub dataset ID for upload (required if --push-to-hub is set).

Complete Examples

Example 1: Basic preprocessing with custom column names

guidellm preprocess dataset \
    kind=csv_file,path=my_dataset.csv \
    "processed_dataset.jsonl" \
    --processor "gpt2" \
    --config "prompt_tokens=512,output_tokens=256" \
    --data-column-mapper '{"kind":"generative_column_mapper","column_mappings": {"text_column": "user_question", "prefix_column": "system_instruction"}}'

Example 2: Preprocessing with distribution and short prompt handling

guidellm preprocess dataset \
    "kind=json_file,path=dataset.jsonl" \
    "processed.jsonl" \
    --processor "gpt2" \
    --config "prompt_tokens=512,prompt_tokens_stdev=50,output_tokens=256,output_tokens_stdev=25" \
    --short-prompt-strategy "concatenate" \
    --concat-delimiter "\n\n" \
    --random-seed 123

Example 3: Preprocessing with processor arguments and prefix token limits

guidellm preprocess dataset \
    "kind=json_file,path=dataset.jsonl" \
    "processed.jsonl" \
    --processor "gpt2" \
    --processor-args '{"use_fast": false}' \
    --config "prompt_tokens=512,output_tokens=256,prefix_tokens_max=100" \
    --include-prefix-in-token-count

Example 4: Preprocessing and uploading to Hugging Face Hub

guidellm preprocess dataset \
    "kind=json_file,path=my_dataset.jsonl" \
    "processed.jsonl" \
    --processor "gpt2" \
    --config "prompt_tokens=512,output_tokens=256" \
    --push-to-hub \
    --hub-dataset-id "username/processed-dataset"

Notes

  • The --config parameter accepts a PreprocessDatasetConfig which includes all token count fields (prompt_tokens, output_tokens, etc.) plus prefix_tokens_max for controlling prefix length. See the Configuration and Processor Options section above for all available parameters.
  • The processor/tokenizer is required because the preprocessing command needs to tokenize prompts to ensure they match target token sizes. See the Data Arguments Overview for tokenizer usage in benchmarks.
  • Column mappings are only needed when your dataset uses non-standard column names. GuideLLM will automatically try common column names if no mapping is provided.
  • When using --short-prompt-strategy concatenate, ensure your dataset has enough samples to concatenate, or some prompts may be skipped.
  • The output format is determined by the file extension of OUTPUT_PATH (e.g., .jsonl, .csv, .parquet).
  • The prefix handling only trims prefixes. It doesn't expand them. Use prefix_tokens_max in the config to set a maximum prefix length, which will trim prefixes that exceed this limit.