v1.3.0
Features
Qwen 3.6 integration
TRL v1.3 ships training support for the new Qwen 3.6 family (Qwen/Qwen3.6-27B, Qwen/Qwen3.6-35B-A3B). Qwen 3.6 reuses the Qwen3_5Moe* architecture but ships a slightly different chat template (adds a preserve_thinking flag, tweaks tool-arg stringification), so exact-string template matching needed updates across the stack.
What landed:
- Chat templates:
qwen3_6.jinja(verbatim from upstream) andqwen3_6_training.jinja(prefix-preserving +{% generation %}markers forassistant_only_loss=True) - Response schema: routes to the existing
qwen3_5_schemafor tool-call parsing — output format unchanged - Tiny test models for VLM training:
tiny-Qwen3_5MoeForConditionalGeneration-3.6(with MoE-specific shrinking) - Test matrix updated across SFT/DPO/GRPO/RLOO
test_(train|training)_vlmcases
from trl import SFTConfig, SFTTrainer
trainer = SFTTrainer(
model="Qwen/Qwen3.6-27B",
args=SFTConfig(assistant_only_loss=True), # works out of the box
train_dataset=dataset,
)
trainer.train()Tool-calling agent training also works end-to-end via the existing Qwen 3.5 response schema:
from trl import GRPOConfig, GRPOTrainer
def multiply(a: int, b: int) -> int:
"""
Multiplies two integers.
Args:
a: The first integer.
b: The second integer.
Returns:
The product of the two integers.
"""
return a * b
trainer = GRPOTrainer(
model="Qwen/Qwen3.6-27B",
reward_funcs=my_reward_fn,
args=GRPOConfig(...),
train_dataset=dataset,
tools=[multiply],
)
trainer.train()by @qgallouedec in #5642
New experimental TPO trainer
A new experimental TPOTrainer implements Triple Preference Optimization, which augments DPO with a reference (gold) completion alongside chosen/rejected. The paper reports +7-19 points over DPO/SimPO on Arena-Hard, MixEval-Hard, MMLU-Pro and GSM8K, with less data.
from trl.experimental.tpo import TPOConfig, TPOTrainer
trainer = TPOTrainer(
model="Qwen/Qwen3-0.6B",
args=TPOConfig(output_dir="Qwen3-0.6B-TPO"),
train_dataset=load_dataset("tpo-alignment/triple-preference-ultrafeedback-40K", split="train"),
)
trainer.train()Speculative decoding in trl vllm-serve
A new --speculative_config JSON flag exposes vLLM's speculative decoding directly through trl vllm-serve — works with native MTP heads (Qwen3 Next), Eagle3 drafts, etc. — without forking the serve script.
# Qwen3 native MTP (no extra draft model)
trl vllm-serve --model Qwen/Qwen3-Next-80B-A3B-Instruct \
--speculative_config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 5}'
# Eagle3 draft model
trl vllm-serve --model Qwen/Qwen3-32B \
--speculative_config '{"model": "RedHatAI/Qwen3-32B-speculator.eagle3", "method": "eagle3", "num_speculative_tokens": 3}'KTO ↔ DPO alignment: nearing the finish line
Twelve more alignment PRs this cycle, bringing KTOTrainer and DPOTrainer essentially into structural parity. Notable shifts include moving completion assembly out of _prepare_dataset into a new DataCollatorForKTO, inlining the two-pass tokenization into a single pass, removing BOS/EOS handling, and supporting IterableDataset and dict eval_dataset. The goal — promoting KTO out of experimental and into stable — is now within reach for an upcoming release.
PRs (all by @albertvillanova): #5582, #5578, #5579, #5583, #5587, #5599, #5601, #5600, #5606, #5612, #5632, #5635
More {% generation %} training chat templates
Three more model families gain training-compatible chat templates with {% generation %} markers, so assistant_only_loss=True works out of the box:
Other
- Support processor in
maybe_apply_chat_templateby @albertvillanova in #5567 - Support VLM processors in
is_chat_template_prefix_preservingby @qgallouedec in #5558 - Check prefix preservation at the token level (not string level) by @qgallouedec in #5559
- Drop vLLM 0.11 support by @qgallouedec in #5549
- Remove
forward_masked_logitsby @qgallouedec in #5626 - Remove dead token attributes from experimental trainers by @albertvillanova in #5565
- Set
_tokenizeras trainer attribute by @albertvillanova in #5489 - Use
PreTrainedTokenizerBasefor tokenizer type hints by @qgallouedec in #5629 - Renaming of internal variables:
async_reward_Xtoasync_Xby @qgallouedec in #5616
Fixes
- Fix entropy calculation in SFT — three bugs at once: misaligned by one position (next-token shift), averaged over the wrong tokens (used
attention_maskinstead oflabel != -100), and wrong cross-rank aggregation (unweighted mean instead of sum/count). The reported entropy undercompletion_only_loss=Trueand sequence parallelism is now correct. Same fix applied to DPO entropy logging. By @qgallouedec in #5620 - Pass
AsyncGRPOTrainer'sprocessing_classtoAsyncRolloutWorkerby @xuanduy04 in #5538 - Fix
generate_tiny_modelsfor gpt-oss by @albertvillanova in #5622 - Fix docstring style in vllm-serve script by @albertvillanova in #5628
- Replace wrong comment about chat template with EOS by @albertvillanova in #5607
Documentation and Examples
- Add chat templates page to web docs by @sergiopaniego in #5581
- Update AsyncGRPO example with GSM8K and tested hyperparameters by @sergiopaniego in #5580
- Update RapidFire AI integration with FSDP and multi-backend tracking by @kamran-rapidfireAI in #5618
CI
- Add doc-builder style check to pre-commit and CI by @albertvillanova in #5630
- Align and update doc-builder commit hash in CI GitHub Actions by @albertvillanova in #5631
- Hotfix CI: Add ruff dependency to doc-builder style check by @albertvillanova in #5634
- Fix CI with dev dependencies for Llava models by @albertvillanova in #5499
- Add additional model parameters to
TestSupportsToolCallingfor improved coverage by @qgallouedec in #5537 - Differentiate Phi-3 and Phi-3.5 in tests by @qgallouedec in #5546
New Contributors
What's Changed
- ⬆️ Bump dev version by @qgallouedec in #5577
- Support processor in maybe_apply_chat_template by @albertvillanova in #5567
- Remove dead token attributes from experimental trainers by @albertvillanova in #5565
- Support VLM processors in
is_chat_template_prefix_preservingby @qgallouedec in #5558 - Align KTO with DPO: Align add_model_tags by @albertvillanova in #5582
- Align KTO with DPO: Align processing_class initialization by @albertvillanova in #5578
- Align KTO with DPO: Align _prepare_dataset by @albertvillanova in #5579
- Align KTO with DPO: Align ref_model preparation for distributed training by @albertvillanova in #5583
- Align KTO with DPO: Make conditional prompt extraction and unpairing in _prepare_dataset by @albertvillanova in #5587
- Update AsyncGRPO example with GSM8K and tested hyperparameters by @sergiopaniego in #5580
- [docs] Add chat templates page to web docs by @sergiopaniego in #5581
- Add additional model parameters to
TestSupportsToolCallingfor improved coverage by @qgallouedec in #5537 - Fix CI with dev dependencies for Llava models by @albertvillanova in #5499
- Differentiate Phi-3 and Phi-3.5 in tests by @qgallouedec in #5546
- Set _tokenizer as trainer attribute by @albertvillanova in #5489
- Align KTO with DPO: Support dict eval_dataset by @albertvillanova in #5599
- Align KTO with DPO: Align tokenization by @albertvillanova in #5601
- Check prefix preservation at the token level by @qgallouedec in #5559
- Replace wrong comment about chat template with EOS by @albertvillanova in #5607
- Align KTO with DPO: Support IterableDataset by @albertvillanova in #5600
- Drop vLLM 0.11 support by @qgallouedec in #5549
- Align KTO with DPO: Remove maybe_apply_chat_template by @albertvillanova in #5606
- [TPO] experimental TPO trainer by @kashif in #5506
- fix: Pass AsyncGRPOTrainer's processing_class to AsyncRolloutWorker by @xuanduy04 in #5538
- docs: update RapidFire AI integration with FSDP and multi-backend tracking by @kamran-rapidfireAI in #5618
- Fix generate_tiny_models for gpt-oss by @albertvillanova in #5622
- Added speculative_config to vllm-serve by @Ofir408 in #5605
- feat(glm-4-moe): Add
{% generation %}markers for training chat template by @casinca in #5519 - Fix docstring style in vllm-serve script by @albertvillanova in #5628
- feat: add Gemma/Gemma2 training chat templates with generation markers by @ps-abhi in #5523
- Align KTO with DPO: Inline tokenization, new output format, DataCollatorForKTO by @albertvillanova in #5612
- feat: add Phi-3 training chat template with generation markers by @RudrenduPaul in #5526
- Remove
forward_masked_logitsby @qgallouedec in #5626 - Use
PreTrainedTokenizerBasefor tokenizer type hints by @qgallouedec in #5629 - Add doc-builder style check to pre-commit and CI by @albertvillanova in #5630
- Align and update doc-builder commit hash in CI GitHub Actions by @albertvillanova in #5631
- Align KTO with DPO: Move completion assembly from _prepare_dataset to data collator by @albertvillanova in #5632
- Hotfix CI: Add ruff dependency to doc-builder style check by @albertvillanova in #5634
- Fix entropy calculation in SFT by @qgallouedec in #5620
- Renaming of internal variables:
async_reward_Xtoasync_Xby @qgallouedec in #5616 - Align KTO with DPO: Remove BOS/EOS handling by @albertvillanova in #5635
- Qwen3.6 integration by @qgallouedec in #5642
- Release: v1.3 by @qgallouedec in #5647
New Contributors
Full Changelog: v1.2.0...v1.3.0