[2/2] refactor: decoupled self distillation trainers; cleanup by LeonEricsson · Pull Request #5883 · huggingface/trl

LeonEricsson · 2026-05-29T12:43:19Z

What does this PR do?

Follow up #5862.
A SDPO loss change and a bunch of non-behavior changing refactors

Changes

SDPO loss → convex combination. Replaces the additive policy + λ·distillation with the paper's (1 - w)·policy + w·distillation (Section 4.5). sdpo_policy_loss_mode is removed; distillation_weight is now the convex weight w ∈ [0, 1] (1.0 = pure distillation = prior default, 0.0 = pure policy gradient)
Config docstrings completed for SDFTConfig/SDPOConfig, removed unused diagnostics from SDFT.
Metric logging few helpers that move logging out of core functions to make things more readable
Method reordering moved around some methods in SDFT/SDPO.
FSDP fix: wrap transformers generation in summon_full_params.
loss utils: docstrings improved, cleanup dead code

notes

item 1 is behavioral change; the rest is docs/structure/fixes.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

AI writing disclosure

We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.

No AI usage: the PR was written entirely by a human.
AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
AI-generated: the PR was mostly or fully generated by an AI tool.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Note

Medium Risk
SDPO training loss semantics change for anyone using hybrid mode or non-1.0 distillation weights; FSDP generation fix affects distributed runs.

Overview
SDPO objective now uses a convex blend (1 - w)·policy + w·distillation with distillation_weight ∈ [0, 1], replacing sdpo_policy_loss_mode (hybrid / distillation_only) and the old additive policy + λ·distillation. Default w=1.0 keeps pure distillation; w=0.0 is GRPO-style policy only. Docs and CLI examples switch to --distillation_weight (e.g. 0.5 for a 50/50 mix). Liger requires distillation_weight=1.0.

SDFT/SDPO trainers rename distillation masking from response_mask to loss_mask (SDFT drops per-sample self_distillation_mask; SDPO still gates on self_distillation_mask). Shared loss utils use selective_log_softmax, rename tail handling to add_tail_bucket, and inline top-k renormalization. FSDP: generate() runs under summon_full_params. SDFT drops diagnostics config fields, defaults disable_dropout=True, expands config docstrings, and refactors metrics (_record_completion_metrics) and method order without changing SDFT’s distillation-only loss path.

^{Reviewed by Cursor Bugbot for commit f4a4393. Bugbot is set up for automated code reviews on this repo. Configure here.}

…onfig parameters moved to sdpoconfig, + other nits

BaseSelfDistillationTrainer was populating _metrics in _log_self_distillation_metric but had no log() override, so those metrics were never forwarded to the Trainer's logging system. The fix merges _metrics into the log dict, prefixes eval keys, and clears after each logging step.

…l-self-distillation # Conflicts: # trl/experimental/sdft/sdft_trainer.py # trl/experimental/sdpo/sdpo_trainer.py # trl/experimental/self_distillation/base_self_distillation_trainer.py # trl/experimental/self_distillation/online_rollout_mixin.py # trl/experimental/self_distillation/teacher_context.py

…ainers

fix: build self-distillation teacher from path for ZeRO-3 compatibility

…rmalization using per sequence length

…ainers

…o refactor/sdft-sdpo-cleanup # Conflicts: # docs/source/sdpo_trainer.md # tests/experimental/test_self_distillation_trainer_behavior.py # trl/experimental/sdft/loss_utils.py # trl/experimental/sdft/sdft_trainer.py # trl/experimental/sdpo/loss_utils.py # trl/experimental/sdpo/sdpo_trainer.py

# Conflicts: # trl/experimental/sdft/sdft_config.py # trl/experimental/sdpo/sdpo.py # trl/experimental/sdpo/sdpo_config.py

This reverts commit e7c6380, reversing changes made to 0718dbc.

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit cb2c378. Configure here.}

kashif · 2026-06-04T14:34:28Z

Went through this and ran it on 2 GPUs. The convex loss matches the paper (section 4.5) and what we discussed earlier, fsdp2 trains fine at both distillation_weight=0.5 and 1.0, and the experimental tests pass (34). One small thing I noticed: distillation_weight is documented as [0, 1] but nothing enforced it, so a value like 1.5 silently makes the policy term negative. Pushed a tiny guard in __post_init__ for it. Everything else looks good to me, and the fsdp summon matches what grpo/rloo already do. LGTM 👍

…trl into refactor/sdft-sdpo-cleanup

LeonEricsson and others added 30 commits April 20, 2026 22:04

v0.1 transition sdft into unified base

06f02a8

sdft transition v1 complete, starting on sdpo

be1bcbc

sdpo transitioned, needs testing

0628701

remove legacy trainers

55111ff

sdft and sdpo transitioned and tested with new base

81def8a

restructure training batch builder

bad6b62

nits

ef43c95

wip removing mixin

efe0eda

remove mixin, refactoring and cleanup

fa1a8f3

always set teacher_model

6a7d5a8

align generation tokenization with grpotrainer

56b2fd1

fix: generation_kwargs bug

4a9d527

fix: incorrect import source

196feee

fixes: cleanup, standardized tokenization, distill loss=0 fix, sdpo c…

3c87400

…onfig parameters moved to sdpoconfig, + other nits

tests: ported old tests + new tests for base class

d2a78e2

couple more tests and test cleanup

8807088

test: nit fix

0612699

move loss aggregation to loss_util + a few docstrings

3d0cd72

fix: minor cursor issues + config docstrings

a432c20

fix: rename full logit distillation+topk into explicit flags

e30ca04

fix(self-distillation): warn on preloaded peft students

3a9ecb2

docs: cleanup

03718eb

fix: distillation mode default hparams

1ac2f3c

merge peft validation, tokenizer fixes, etc from upstream/main

e4bfd50

remove base class -> seperate independent SDFT/SDPO

5b35e18

shifted boundary between shared/common helpers for self distill trainers

4452dcd

refactoring and cleaning test suite

05da798

removed unsatisfactory tests

6238458

qgallouedec and others added 16 commits June 1, 2026 11:11

Merge branch 'main' into refactor/self-contained-self-distillation-tr…

7f8539d

…ainers

fix: build self-distillation teacher from path for ZeRO-3 compatibility

4d971e8

feat: liger fused JSD loss for SDFT

4c57ca0

feat: liger fused JSD loss for SDPO

0c0a40a

test: add self-distillation liger equivalence coverage

a4145bd

Merge pull request #4 from kashif/fix/self-distillation-zero3-teacher

cdf75ba

fix: build self-distillation teacher from path for ZeRO-3 compatibility

remove aggregate_loss for distillation losses, hardcode grpo-style no…

d68be79

…rmalization using per sequence length

liger: normalize distillation loss per sequence to match non-liger path

820bb21

liger: drop teacher.eval() that flipped the shared student to eval

c77491d

guard against ema teacher with non-pure-LoRA PEFT

a643e81

Merge branch 'main' into refactor/self-contained-self-distillation-tr…

ce5e205

…ainers

docs: document use_liger_kernel and ema PEFT limitation for sdft/sdpo

efe6cd1

Merge branch 'main' into refactor/sdft-sdpo-cleanup

e7c6380

# Conflicts: # trl/experimental/sdft/sdft_config.py # trl/experimental/sdpo/sdpo.py # trl/experimental/sdpo/sdpo_config.py

Revert "Merge branch 'main' into refactor/sdft-sdpo-cleanup"

77814aa

This reverts commit e7c6380, reversing changes made to 0718dbc.

sync main

bc9be6a

cursor Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread trl/experimental/sdpo/sdpo.py

restore branch state following failed merge

cb2c378

cursor Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread trl/experimental/sdft/sdft_trainer.py

LeonEricsson added 2 commits June 3, 2026 09:33

refactor: sdpo response_mask to loss_mask

6f826d5

Merge branch 'main' into refactor/sdft-sdpo-cleanup

405c39d

LeonEricsson requested review from kashif and qgallouedec June 3, 2026 07:38

refactor: re-order sdft/sdpo methods, refactor metric logging

b1703de

kashif self-assigned this Jun 4, 2026

validate distillation_weight is in [0, 1]

66d1060

kashif approved these changes Jun 4, 2026

View reviewed changes

Merge branch 'refactor/sdft-sdpo-cleanup' of github.com:LeonEricsson/…

f4a4393

…trl into refactor/sdft-sdpo-cleanup

kashif mentioned this pull request Jun 4, 2026

SDFT/SDPO: live teacher logprobs from the vLLM server LeonEricsson/trl#5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2/2] refactor: decoupled self distillation trainers; cleanup#5883

[2/2] refactor: decoupled self distillation trainers; cleanup#5883
LeonEricsson wants to merge 82 commits into
huggingface:mainfrom
LeonEricsson:refactor/sdft-sdpo-cleanup

LeonEricsson commented May 29, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

kashif commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

LeonEricsson commented May 29, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Changes

notes

Before submitting

AI writing disclosure

Who can review?

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kashif commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

LeonEricsson commented May 29, 2026 •

edited by cursor Bot

Loading