ractogateway.finetune.dataset

Training dataset primitives for multimodal LLM fine-tuning.

Classes

RactoTrainingMessage

One turn in a training conversation (role + text + optional file attachments).

RactoTrainingExample

A complete multi-turn conversation used as a single training record.

RactoDataset

Ordered collection of examples with validation, splitting, and JSONL export.

class ractogateway.finetune.dataset.RactoTrainingMessage(role, content, attachments=<factory>)[source]

Bases: object

One conversational turn inside a training example.

Parameters:
  • role (Literal['system', 'user', 'assistant']) – Speaker role.

  • content (str) – Text content of the message.

  • attachments (list[RactoFile]) – Optional images / PDFs for multimodal training examples. Use RactoFile.from_path() or RactoFile.from_bytes().

role: Literal['system', 'user', 'assistant']
content: str
attachments: list[RactoFile]
to_openai()[source]

Return an OpenAI-compatible message dict.

Text-only messages produce {"role": ..., "content": str}. Messages with attachments produce a content-block list: {"role": ..., "content": [image_url_block, ..., text_block]}.

Return type:

dict[str, Any]

to_anthropic()[source]

Return an Anthropic-compatible message dict.

System messages should be lifted to the top-level system field — RactoTrainingExample.to_anthropic_dict() handles this automatically.

Return type:

dict[str, Any]

to_gemini_parts()[source]

Return a list of Gemini content parts (text + inline_data).

Return type:

list[dict[str, Any]]

class ractogateway.finetune.dataset.RactoTrainingExample(messages)[source]

Bases: object

A complete conversation used as one training record.

Parameters:

messages (list[RactoTrainingMessage]) –

Ordered turns. Typical shapes:

  • Single-turn : [user, assistant]

  • With system : [system, user, assistant]

  • Multi-turn : [system, user, assistant, user, assistant, …]

Examples

>>> ex = RactoTrainingExample.from_pair(
...     user="What is 2 + 2?",
...     assistant="4",
...     system="You are a maths tutor.",
... )
>>> # Multimodal example (image + question)
>>> ex = RactoTrainingExample.from_pair(
...     user="Describe this chart.",
...     assistant="The chart shows monthly revenue for Q4 2024.",
...     user_attachments=[RactoFile.from_path("chart.png")],
... )
classmethod from_pair(user, assistant, *, system='', user_attachments=None)[source]

Create a single-turn (prompt → completion) training example.

Parameters:
  • user (str) – The user prompt.

  • assistant (str) – The desired model response.

  • system (str) – Optional system prompt prepended to the conversation.

  • user_attachments (list[RactoFile] | None) – Images or other files attached to the user turn.

Return type:

RactoTrainingExample

classmethod from_conversation(turns)[source]

Build from a list of (role, content) tuples.

Parameters:

turns (list[tuple[Literal['system', 'user', 'assistant'], str]]) – E.g. [("system", "…"), ("user", "…"), ("assistant", "…")]

Return type:

RactoTrainingExample

to_openai_dict()[source]

Serialize to OpenAI fine-tuning JSONL record.

Output format:

{"messages": [{"role": "system", "content": "…"}, …]}
Return type:

dict[str, Any]

to_anthropic_dict()[source]

Serialize to Anthropic fine-tuning JSONL record.

Output format:

{"system": "…", "messages": [{"role": "user", …}, …]}

The system key is only present when a system message exists.

Return type:

dict[str, Any]

to_gemini_dict()[source]

Serialize to Gemini tuning record.

For text-only single-turn examples (most common) the output is:

{"text_input": "…", "output": "…"}

For multimodal or multi-turn examples the Vertex AI contents format is used:

{"contents": [{"role": "user", "parts": […]}, …]}
Return type:

dict[str, Any]

class ractogateway.finetune.dataset.RactoDataset(examples=None)[source]

Bases: object

An ordered collection of RactoTrainingExample objects.

This is the central data container for building, validating, splitting, and exporting fine-tuning datasets for any supported LLM provider.

Parameters:

examples (list[RactoTrainingExample] | None) – Initial examples. An empty dataset is created when omitted.

Examples

Build from (user, assistant) pairs:

ds = RactoDataset.from_pairs(
    [
        ("What is Python?", "Python is a high-level programming language."),
        ("What is a list?", "A list is a mutable ordered sequence."),
    ],
    system="You are a Python tutor.",
)

Add multimodal examples manually:

ds.add(
    RactoTrainingExample.from_pair(
        user="Describe this image.",
        assistant="The image shows a flowchart with three decision nodes.",
        user_attachments=[RactoFile.from_path("diagram.png")],
    )
)

Export to JSONL for fine-tuning:

train_ds, val_ds = ds.split(0.8, seed=42)
train_ds.export_jsonl("train.jsonl", provider="openai")
val_ds.export_jsonl("val.jsonl", provider="openai")
add(example)[source]

Append a single training example.

Return type:

None

extend(examples)[source]

Append multiple training examples at once.

Return type:

None

classmethod from_pairs(pairs, *, system='')[source]

Build a text-only dataset from (user, assistant) pairs.

Parameters:
  • pairs (list[tuple[str, str]]) – Each tuple is (user_message, expected_assistant_response).

  • system (str) – Optional system prompt applied uniformly to every example.

Return type:

RactoDataset

classmethod from_jsonl(path, provider='openai')[source]

Load a JSONL dataset previously exported for provider.

Supports text-only OpenAI, Anthropic, and Gemini formats.

Parameters:
  • path (str | Path) – Path to the .jsonl file.

  • provider (str) – One of "openai", "anthropic", "gemini".

Return type:

RactoDataset

shuffle(seed=None)[source]

Return a new dataset with examples in random order.

Parameters:

seed (int | None) – Optional random seed for reproducibility.

Return type:

RactoDataset

split(train_ratio=0.8, *, seed=None)[source]

Split into train and validation datasets.

Parameters:
  • train_ratio (float) – Fraction of examples for the training split. Must be between 0 and 1 (exclusive).

  • seed (int | None) – Optional random seed for reproducible shuffling.

Return type:

tuple[RactoDataset, RactoDataset]

Returns:

tuple[RactoDataset, RactoDataset](train_dataset, validation_dataset)

validate(provider='openai')[source]

Check examples for common formatting errors.

Parameters:

provider (str) – Provider to validate against ("openai", "anthropic", or "gemini").

Return type:

list[str]

Returns:

list[str] – A list of human-readable error strings. An empty list means the dataset is ready to use.

to_jsonl_string(provider='openai')[source]

Serialize all examples to a JSONL string for provider.

Parameters:

provider (str) – One of "openai" / "generic", "anthropic", "gemini".

Return type:

str

export_jsonl(path, provider='openai', *, overwrite=False)[source]

Write the dataset to a .jsonl file on disk.

Parameters:
  • path (str | Path) – Destination file path.

  • provider (str) – One of "openai", "anthropic", "gemini".

  • overwrite (bool) – When False (default), raise FileExistsError if the file already exists.

Return type:

Path

Returns:

Path – The resolved path of the written file.

summary()[source]

Return brief statistics about the dataset.

Return type:

dict[str, Any]

Returns:

dict – Keys: examples, total_messages, avg_turns_per_example, multimodal_examples.