guidellm.data
DataArgs
Bases: PydanticClassRegistryMixin['DataArgs'], ABC
Base class for data loading and processing argument models.
This class serves as a base for defining argument models related to data loading and processing. It inherits from PydanticClassRegistryMixin to enable automatic registration of subclasses, allowing for flexible and extensible data handling configurations.
Attributes:
| Name | Type | Description |
|---|---|---|
schema_discriminator | str | Field name for polymorphic deserialization |
Source code in src/guidellm/data/schemas/entrypoints.py
__pydantic_schema_base_type__() classmethod
Return base type for polymorphic validation hierarchy.
Returns:
| Type | Description |
|---|---|
type[DataArgs] | Base Profile class for schema validation |
Source code in src/guidellm/data/schemas/entrypoints.py
DataFinalizerArgs
Bases: PydanticClassRegistryMixin['DataFinalizerArgs'], ABC
Base class for data finalizer argument models.
This class serves as a base for defining arguments related to data finalization configurations. It inherits from PydanticClassRegistryMixin to enable automatic registration of subclasses, allowing for flexible and extensible data finalization configurations.
Attributes:
| Name | Type | Description |
|---|---|---|
schema_discriminator | str | Field name for polymorphic deserialization |
Source code in src/guidellm/data/schemas/entrypoints.py
__pydantic_schema_base_type__() classmethod
Return base type for polymorphic validation hierarchy.
Returns:
| Type | Description |
|---|---|
type[DataFinalizerArgs] | Base DataFinalizerArgs class for schema validation |
Source code in src/guidellm/data/schemas/entrypoints.py
DataLoaderArgs
Bases: PydanticClassRegistryMixin['DataLoaderArgs'], ABC
Base class for data loader argument models.
This class serves as a base for defining argument models related to data loading configurations. It inherits from PydanticClassRegistryMixin to enable automatic registration of subclasses, allowing for flexible and extensible data loading configurations.
Attributes:
| Name | Type | Description |
|---|---|---|
schema_discriminator | str | Field name for polymorphic deserialization |
Source code in src/guidellm/data/schemas/entrypoints.py
__pydantic_schema_base_type__() classmethod
Return base type for polymorphic validation hierarchy.
Returns:
| Type | Description |
|---|---|
type[DataLoaderArgs] | Base DataLoaderArgs class for schema validation |
Source code in src/guidellm/data/schemas/entrypoints.py
DataLoaderRegistry
Bases: Generic[DataT_co], RegistryMixin[type[DataLoader]]
Source code in src/guidellm/data/loaders/loader.py
create(config, datasets, preprocessors, finalizer, random_seed, **kwargs) classmethod
Factory method to create a DataLoader instance based on provided configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config | DataLoaderArgs | A DataEntrypointArgs object containing the configuration. | required |
Source code in src/guidellm/data/loaders/loader.py
DataNotSupportedError
DataPreprocessorArgs
Bases: PydanticClassRegistryMixin['DataPreprocessorArgs'], ABC
Base class for data preprocessor argument models.
This class serves as a base for defining arguments related to data preprocessing configurations. It inherits from PydanticClassRegistryMixin to enable automatic registration of subclasses, allowing for flexible and extensible data preprocessing configurations.
Attributes:
| Name | Type | Description |
|---|---|---|
schema_discriminator | str | Field name for polymorphic deserialization |
Source code in src/guidellm/data/schemas/entrypoints.py
__pydantic_schema_base_type__() classmethod
Return base type for polymorphic validation hierarchy.
Returns:
| Type | Description |
|---|---|
type[DataPreprocessorArgs] | Base DataPreprocessorArgs class for schema validation |
Source code in src/guidellm/data/schemas/entrypoints.py
DataTokenizerArgs
Bases: PydanticClassRegistryMixin['DataTokenizerArgs'], ABC
Base class for data tokenizer argument models.
This class serves as a base for defining arguments related to data tokenization configurations. It inherits from PydanticClassRegistryMixin to enable automatic registration of subclasses, allowing for flexible and extensible data tokenization configurations.
Attributes:
| Name | Type | Description |
|---|---|---|
schema_discriminator | str | Field name for polymorphic deserialization |
Source code in src/guidellm/data/schemas/entrypoints.py
__pydantic_schema_base_type__() classmethod
Return base type for polymorphic validation hierarchy.
Returns:
| Type | Description |
|---|---|
type[DataTokenizerArgs] | Base DataTokenizerArgs class for schema validation |
Source code in src/guidellm/data/schemas/entrypoints.py
DatasetFinalizer
Bases: Protocol[DataT_co]
Protocol for finalizing dataset rows into a desired data type.
Source code in src/guidellm/data/finalizers/finalizer.py
FinalizerRegistry
Bases: RegistryMixin[type[DatasetFinalizer]]
Source code in src/guidellm/data/finalizers/finalizer.py
create(config) classmethod
Factory method to create a DatasetFinalizer instance based on configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config | DataFinalizerArgs | A DataFinalizerArgs object containing the configuration. | required |
Source code in src/guidellm/data/finalizers/finalizer.py
GenerativeRequestFinalizer
Bases: DatasetFinalizer[Iterable[GenerationRequest]]
Finalizer that converts dataset rows into GenerationRequest objects, aggregating usage metrics from the provided columns.
Source code in src/guidellm/data/finalizers/generative.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | |
GenerativeRequestFinalizerArgs
Bases: DataFinalizerArgs
Model for generative request finalizer arguments.
Source code in src/guidellm/data/finalizers/generative.py
PreprocessorRegistry
Bases: RegistryMixin[type[DatasetPreprocessor] | type[DataDependentPreprocessor]]
Source code in src/guidellm/data/preprocessors/preprocessor.py
create(config) classmethod
Factory method to create a DatasetPreprocessor instance based on configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config | DataPreprocessorArgs | A DataPreprocessorArgs object containing the configuration. | required |
Source code in src/guidellm/data/preprocessors/preprocessor.py
TorchDataLoaderArgs
Bases: DataLoaderArgs
Model for PyTorch data loader arguments.
Source code in src/guidellm/data/loaders/torch.py
create_data_loader(loader_config, data_config, tokenizer_config, column_mapper_config, preprocessors_config, finalizer_config, random_seed=42, console=None) async
Factory function to create a DataLoader instance based on provided configurations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
loader_config | DataLoaderArgs | Configuration for the data loader. | required |
data_config | list[DataArgs] | List of configurations for dataset deserialization. | required |
tokenizer_config | DataTokenizerArgs | Configuration for the tokenizer factory. | required |
column_mapper_config | DataPreprocessorArgs | Configuration for the column mapping preprocessor. | required |
preprocessors_config | list[DataPreprocessorArgs] | List of configurations for additional preprocessors. | required |
finalizer_config | DataFinalizerArgs | Configuration for the dataset finalizer. | required |
random_seed | int | Seed for random operations to ensure reproducibility. | 42 |
console | Console | None | Optional Console instance for logging and progress display. | None |
Returns:
| Type | Description |
|---|---|
DataLoader | An instance of DataLoader configured according to the provided arguments. |
Source code in src/guidellm/data/entrypoints.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 | |
process_dataset(data, output_path, processor, config, processor_args=None, data_args=None, data_column_mapper=None, short_prompt_strategy=ShortPromptStrategy.IGNORE, pad_char=None, concat_delimiter=None, include_prefix_in_token_count=False, push_to_hub=False, hub_dataset_id=None, random_seed=42)
Main method to process and save a dataset with sampled prompt/output token counts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data | dict | Path or identifier for dataset input. | required |
output_path | str | Path | File path to save the processed dataset. | required |
processor | str | Path | PreTrainedTokenizerBase | Tokenizer object or its config. | required |
config | str | Path | PreprocessDatasetConfig string or file path. | required |
processor_args | dict[str, Any] | None | Optional processor arguments. | None |
data_args | dict[str, Any] | None | Optional data loading arguments. | None |
data_column_mapper | dict[str, str] | None | Optional column mapping dictionary. | None |
short_prompt_strategy | ShortPromptStrategy | Strategy for handling short prompts. | IGNORE |
pad_char | str | None | Character used when padding short prompts. | None |
concat_delimiter | str | None | Delimiter for concatenation strategy. | None |
include_prefix_in_token_count | bool | Whether to include prefix in prompt token count, simplifying the token counts. When True, prefix trimming is disabled and the prefix is kept as-is. The prefix token count is subtracted from the prompt token budget instead. | False |
push_to_hub | bool | Whether to push to Hugging Face Hub. | False |
hub_dataset_id | str | None | Dataset ID on Hugging Face Hub. | None |
random_seed | int | Seed for random sampling. | 42 |
Raises:
| Type | Description |
|---|---|
ValueError | If the output path is invalid or pushing conditions unmet. |