v1.3.0

@qgallouedec

Features

Qwen 3.6 integration

ChatGPT Image Apr 26, 2026 at 11_16_18 AM

TRL v1.3 ships training support for the new Qwen 3.6 family (Qwen/Qwen3.6-27B, Qwen/Qwen3.6-35B-A3B). Qwen 3.6 reuses the Qwen3_5Moe* architecture but ships a slightly different chat template (adds a preserve_thinking flag, tweaks tool-arg stringification), so exact-string template matching needed updates across the stack.

What landed:

Chat templates: qwen3_6.jinja (verbatim from upstream) and qwen3_6_training.jinja (prefix-preserving + {% generation %} markers for assistant_only_loss=True)
Response schema: routes to the existing qwen3_5_schema for tool-call parsing — output format unchanged
Tiny test models for VLM training: tiny-Qwen3_5MoeForConditionalGeneration-3.6 (with MoE-specific shrinking)
Test matrix updated across SFT/DPO/GRPO/RLOO test_(train|training)_vlm cases

from trl import SFTConfig, SFTTrainer

trainer = SFTTrainer(
    model="Qwen/Qwen3.6-27B",
    args=SFTConfig(assistant_only_loss=True),  # works out of the box
    train_dataset=dataset,
)
trainer.train()

Tool-calling agent training also works end-to-end via the existing Qwen 3.5 response schema:

from trl import GRPOConfig, GRPOTrainer

def multiply(a: int, b: int) -> int:
    """
    Multiplies two integers.

    Args:
        a: The first integer.
        b: The second integer.

    Returns:
        The product of the two integers.
    """
    return a * b

trainer = GRPOTrainer(
    model="Qwen/Qwen3.6-27B",
    reward_funcs=my_reward_fn,
    args=GRPOConfig(...),
    train_dataset=dataset,
    tools=[multiply],
)
trainer.train()

by @qgallouedec in #5642

New experimental TPO trainer

A new experimental TPOTrainer implements Triple Preference Optimization, which augments DPO with a reference (gold) completion alongside chosen/rejected. The paper reports +7-19 points over DPO/SimPO on Arena-Hard, MixEval-Hard, MMLU-Pro and GSM8K, with less data.

from trl.experimental.tpo import TPOConfig, TPOTrainer

trainer = TPOTrainer(
    model="Qwen/Qwen3-0.6B",
    args=TPOConfig(output_dir="Qwen3-0.6B-TPO"),
    train_dataset=load_dataset("tpo-alignment/triple-preference-ultrafeedback-40K", split="train"),
)
trainer.train()

by @kashif in #5506

Speculative decoding in `trl vllm-serve`

A new --speculative_config JSON flag exposes vLLM's speculative decoding directly through trl vllm-serve — works with native MTP heads (Qwen3 Next), Eagle3 drafts, etc. — without forking the serve script.

# Qwen3 native MTP (no extra draft model)
trl vllm-serve --model Qwen/Qwen3-Next-80B-A3B-Instruct \
    --speculative_config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 5}'

# Eagle3 draft model
trl vllm-serve --model Qwen/Qwen3-32B \
    --speculative_config '{"model": "RedHatAI/Qwen3-32B-speculator.eagle3", "method": "eagle3", "num_speculative_tokens": 3}'

by @Ofir408 in #5605

KTO ↔ DPO alignment: nearing the finish line

Twelve more alignment PRs this cycle, bringing KTOTrainer and DPOTrainer essentially into structural parity. Notable shifts include moving completion assembly out of _prepare_dataset into a new DataCollatorForKTO, inlining the two-pass tokenization into a single pass, removing BOS/EOS handling, and supporting IterableDataset and dict eval_dataset. The goal — promoting KTO out of experimental and into stable — is now within reach for an upcoming release.

PRs (all by @albertvillanova): #5582, #5578, #5579, #5583, #5587, #5599, #5601, #5600, #5606, #5612, #5632, #5635

More `{% generation %}` training chat templates

Three more model families gain training-compatible chat templates with {% generation %} markers, so assistant_only_loss=True works out of the box:

Gemma / Gemma 2 by @ps-abhi in #5523
Phi-3 by @RudrenduPaul in #5526
GLM-4-MoE by @casinca in #5519

Other

Support processor in maybe_apply_chat_template by @albertvillanova in #5567
Support VLM processors in is_chat_template_prefix_preserving by @qgallouedec in #5558
Check prefix preservation at the token level (not string level) by @qgallouedec in #5559
Drop vLLM 0.11 support by @qgallouedec in #5549
Remove forward_masked_logits by @qgallouedec in #5626
Remove dead token attributes from experimental trainers by @albertvillanova in #5565
Set _tokenizer as trainer attribute by @albertvillanova in #5489
Use PreTrainedTokenizerBase for tokenizer type hints by @qgallouedec in #5629
Renaming of internal variables: async_reward_X to async_X by @qgallouedec in #5616

Fixes

Fix entropy calculation in SFT — three bugs at once: misaligned by one position (next-token shift), averaged over the wrong tokens (used attention_mask instead of label != -100), and wrong cross-rank aggregation (unweighted mean instead of sum/count). The reported entropy under completion_only_loss=True and sequence parallelism is now correct. Same fix applied to DPO entropy logging. By @qgallouedec in #5620
Pass AsyncGRPOTrainer's processing_class to AsyncRolloutWorker by @xuanduy04 in #5538
Fix generate_tiny_models for gpt-oss by @albertvillanova in #5622
Fix docstring style in vllm-serve script by @albertvillanova in #5628
Replace wrong comment about chat template with EOS by @albertvillanova in #5607

Documentation and Examples

Add chat templates page to web docs by @sergiopaniego in #5581
Update AsyncGRPO example with GSM8K and tested hyperparameters by @sergiopaniego in #5580
Update RapidFire AI integration with FSDP and multi-backend tracking by @kamran-rapidfireAI in #5618

CI

Add doc-builder style check to pre-commit and CI by @albertvillanova in #5630
Align and update doc-builder commit hash in CI GitHub Actions by @albertvillanova in #5631
Hotfix CI: Add ruff dependency to doc-builder style check by @albertvillanova in #5634
Fix CI with dev dependencies for Llava models by @albertvillanova in #5499
Add additional model parameters to TestSupportsToolCalling for improved coverage by @qgallouedec in #5537
Differentiate Phi-3 and Phi-3.5 in tests by @qgallouedec in #5546

New Contributors

@Ofir408 made their first contribution in #5605
@ps-abhi made their first contribution in #5523

What's Changed

⬆️ Bump dev version by @qgallouedec in #5577
Support processor in maybe_apply_chat_template by @albertvillanova in #5567
Remove dead token attributes from experimental trainers by @albertvillanova in #5565
Support VLM processors in is_chat_template_prefix_preserving by @qgallouedec in #5558
Align KTO with DPO: Align add_model_tags by @albertvillanova in #5582
Align KTO with DPO: Align processing_class initialization by @albertvillanova in #5578
Align KTO with DPO: Align _prepare_dataset by @albertvillanova in #5579
Align KTO with DPO: Align ref_model preparation for distributed training by @albertvillanova in #5583
Align KTO with DPO: Make conditional prompt extraction and unpairing in _prepare_dataset by @albertvillanova in #5587
Update AsyncGRPO example with GSM8K and tested hyperparameters by @sergiopaniego in #5580
[docs] Add chat templates page to web docs by @sergiopaniego in #5581
Add additional model parameters to TestSupportsToolCalling for improved coverage by @qgallouedec in #5537
Fix CI with dev dependencies for Llava models by @albertvillanova in #5499
Differentiate Phi-3 and Phi-3.5 in tests by @qgallouedec in #5546
Set _tokenizer as trainer attribute by @albertvillanova in #5489
Align KTO with DPO: Support dict eval_dataset by @albertvillanova in #5599
Align KTO with DPO: Align tokenization by @albertvillanova in #5601
Check prefix preservation at the token level by @qgallouedec in #5559
Replace wrong comment about chat template with EOS by @albertvillanova in #5607
Align KTO with DPO: Support IterableDataset by @albertvillanova in #5600
Drop vLLM 0.11 support by @qgallouedec in #5549
Align KTO with DPO: Remove maybe_apply_chat_template by @albertvillanova in #5606
[TPO] experimental TPO trainer by @kashif in #5506
fix: Pass AsyncGRPOTrainer's processing_class to AsyncRolloutWorker by @xuanduy04 in #5538
docs: update RapidFire AI integration with FSDP and multi-backend tracking by @kamran-rapidfireAI in #5618
Fix generate_tiny_models for gpt-oss by @albertvillanova in #5622
Added speculative_config to vllm-serve by @Ofir408 in #5605
feat(glm-4-moe): Add {% generation %} markers for training chat template by @casinca in #5519
Fix docstring style in vllm-serve script by @albertvillanova in #5628
feat: add Gemma/Gemma2 training chat templates with generation markers by @ps-abhi in #5523
Align KTO with DPO: Inline tokenization, new output format, DataCollatorForKTO by @albertvillanova in #5612
feat: add Phi-3 training chat template with generation markers by @RudrenduPaul in #5526
Remove forward_masked_logits by @qgallouedec in #5626
Use PreTrainedTokenizerBase for tokenizer type hints by @qgallouedec in #5629
Add doc-builder style check to pre-commit and CI by @albertvillanova in #5630
Align and update doc-builder commit hash in CI GitHub Actions by @albertvillanova in #5631
Align KTO with DPO: Move completion assembly from _prepare_dataset to data collator by @albertvillanova in #5632
Hotfix CI: Add ruff dependency to doc-builder style check by @albertvillanova in #5634
Fix entropy calculation in SFT by @qgallouedec in #5620
Renaming of internal variables: async_reward_X to async_X by @qgallouedec in #5616
Align KTO with DPO: Remove BOS/EOS handling by @albertvillanova in #5635
Qwen3.6 integration by @qgallouedec in #5642
Release: v1.3 by @qgallouedec in #5647

New Contributors

@Ofir408 made their first contribution in #5605
@ps-abhi made their first contribution in #5523

Full Changelog: v1.2.0...v1.3.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.3.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Features

Qwen 3.6 integration

New experimental TPO trainer

Speculative decoding in `trl vllm-serve`

KTO ↔ DPO alignment: nearing the finish line

More `{% generation %}` training chat templates

Other

Fixes

Documentation and Examples

CI

New Contributors

What's Changed

New Contributors

Contributors

Uh oh!

v1.3.0

Features

Qwen 3.6 integration

New experimental TPO trainer

Speculative decoding in trl vllm-serve

KTO ↔ DPO alignment: nearing the finish line

More {% generation %} training chat templates

Other

Fixes

Documentation and Examples

CI

New Contributors

What's Changed

New Contributors

Contributors

Uh oh!

Speculative decoding in `trl vllm-serve`

More `{% generation %}` training chat templates