`guidellm.data.deserializers.trace_mooncake`

Trace deserializer for Mooncake formatted files that generates synthetic prompts per row.

Reads a trace file (timestamp, input_length, output_length, hash_ids) and yields one row per line with a synthetic prompt matching the requested input_length for replay benchmarks.

`TraceMooncakeDataArgs`

Bases: TraceSyntheticDataArgs

Model for Mooncake trace dataset deserializer arguments.

Source code in src/guidellm/data/deserializers/trace_mooncake.py

@DataArgs.register("mooncake")
class TraceMooncakeDataArgs(TraceSyntheticDataArgs):
    """Model for Mooncake trace dataset deserializer arguments."""

    kind: Literal["mooncake"] = Field(  # type: ignore[assignment]
        default="mooncake",
        description="Type identifier for the trace Mooncake dataset deserializer.",
    )
    hash_ids_column: str = Field(
        default="hash_ids",
        description="Column name for lists of hash IDs in the trace file.",
    )
    hash_id_block_size: int = Field(
        gt=0,
        # Default used in Mooncake's paper https://arxiv.org/pdf/2407.00079
        default=512,
        description="Amount of tokens represented by one hash ID.",
    )

`TraceMooncakeDatasetDeserializer`

Bases: DatasetDeserializer

Mooncake trace format deserializer

The Mooncake trace format requires a column for timestamps, prompt token counts, ouput token counts and lists of hash IDs.

Hash IDs are globally unique identifiers based on the current and previous token blocks in a prompt. The relationships of IDs forms a tree, where every first ID in a prompt has a parent node of None. Parent nodes can have an unbounded number of children. Two hash IDs can represent identical blocks of tokens so long as they do not share the same parent (previous ID). For more details, see section 4 of https://arxiv.org/pdf/2407.00079.

Generated prompts match the prompt token count of the row.

Source code in src/guidellm/data/deserializers/trace_mooncake.py

@DatasetDeserializerFactory.register("mooncake")
class TraceMooncakeDatasetDeserializer(DatasetDeserializer):
    """Mooncake trace format deserializer

    The Mooncake trace format requires a column for timestamps, prompt token counts,
    ouput token counts and lists of hash IDs.

    Hash IDs are globally unique identifiers based on the current and previous token
    blocks in a prompt. The relationships of IDs forms a tree, where every first ID
    in a prompt has a parent node of `None`. Parent nodes can have an unbounded
    number of children. Two hash IDs can represent identical blocks of tokens so long
    as they do not share the same parent (previous ID). For more details, see section 4
    of https://arxiv.org/pdf/2407.00079.

    Generated prompts match the prompt token count of the row."""

    def __call__(
        self,
        config: TraceMooncakeDataArgs,
        processor_factory: Callable[[], PreTrainedTokenizerBase],
        random_seed: int,
    ) -> IterableDataset:
        if not config.path.is_file():
            raise DataNotSupportedError(
                f"{type(self).__name__} expects a path to a trace file, "
                f"got {config.path}"
            )
        return _TraceMooncakeDataset(config, processor_factory(), random_seed)