Skip to content

[2/2] refactor: decoupled self distillation trainers; cleanup#5883

Open
LeonEricsson wants to merge 82 commits into
huggingface:mainfrom
LeonEricsson:refactor/sdft-sdpo-cleanup
Open

[2/2] refactor: decoupled self distillation trainers; cleanup#5883
LeonEricsson wants to merge 82 commits into
huggingface:mainfrom
LeonEricsson:refactor/sdft-sdpo-cleanup

Conversation

@LeonEricsson
Copy link
Copy Markdown
Collaborator

@LeonEricsson LeonEricsson commented May 29, 2026

What does this PR do?

Follow up #5862.
A SDPO loss change and a bunch of non-behavior changing refactors

Changes

  1. SDPO loss → convex combination. Replaces the additive policy + λ·distillation with the paper's (1 - w)·policy + w·distillation (Section 4.5). sdpo_policy_loss_mode is removed; distillation_weight is now the convex weight w ∈ [0, 1] (1.0 = pure distillation = prior default, 0.0 = pure policy gradient)
  2. Config docstrings completed for SDFTConfig/SDPOConfig, removed unused diagnostics from SDFT.
  3. Metric logging few helpers that move logging out of core functions to make things more readable
  4. Method reordering moved around some methods in SDFT/SDPO.
  5. FSDP fix: wrap transformers generation in summon_full_params.
  6. loss utils: docstrings improved, cleanup dead code

notes

item 1 is behavioral change; the rest is docs/structure/fixes.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

AI writing disclosure

We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.

  • No AI usage: the PR was written entirely by a human.
  • AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
  • AI-generated: the PR was mostly or fully generated by an AI tool.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.


Note

Medium Risk
SDPO training loss semantics change for anyone using hybrid mode or non-1.0 distillation weights; FSDP generation fix affects distributed runs.

Overview
SDPO objective now uses a convex blend (1 - w)·policy + w·distillation with distillation_weight ∈ [0, 1], replacing sdpo_policy_loss_mode (hybrid / distillation_only) and the old additive policy + λ·distillation. Default w=1.0 keeps pure distillation; w=0.0 is GRPO-style policy only. Docs and CLI examples switch to --distillation_weight (e.g. 0.5 for a 50/50 mix). Liger requires distillation_weight=1.0.

SDFT/SDPO trainers rename distillation masking from response_mask to loss_mask (SDFT drops per-sample self_distillation_mask; SDPO still gates on self_distillation_mask). Shared loss utils use selective_log_softmax, rename tail handling to add_tail_bucket, and inline top-k renormalization. FSDP: generate() runs under summon_full_params. SDFT drops diagnostics config fields, defaults disable_dropout=True, expands config docstrings, and refactors metrics (_record_completion_metrics) and method order without changing SDFT’s distillation-only loss path.

Reviewed by Cursor Bugbot for commit f4a4393. Bugbot is set up for automated code reviews on this repo. Configure here.

LeonEricsson and others added 30 commits April 20, 2026 22:04
…onfig parameters moved to sdpoconfig, + other nits
BaseSelfDistillationTrainer was populating _metrics in
_log_self_distillation_metric but had no log() override, so those
metrics were never forwarded to the Trainer's logging system. The fix
merges _metrics into the log dict, prefixes eval keys, and clears after
each logging step.
…l-self-distillation

# Conflicts:
#	trl/experimental/sdft/sdft_trainer.py
#	trl/experimental/sdpo/sdpo_trainer.py
#	trl/experimental/self_distillation/base_self_distillation_trainer.py
#	trl/experimental/self_distillation/online_rollout_mixin.py
#	trl/experimental/self_distillation/teacher_context.py
qgallouedec and others added 16 commits June 1, 2026 11:11
fix: build self-distillation teacher from path for ZeRO-3 compatibility
…o refactor/sdft-sdpo-cleanup

# Conflicts:
#	docs/source/sdpo_trainer.md
#	tests/experimental/test_self_distillation_trainer_behavior.py
#	trl/experimental/sdft/loss_utils.py
#	trl/experimental/sdft/sdft_trainer.py
#	trl/experimental/sdpo/loss_utils.py
#	trl/experimental/sdpo/sdpo_trainer.py
# Conflicts:
#	trl/experimental/sdft/sdft_config.py
#	trl/experimental/sdpo/sdpo.py
#	trl/experimental/sdpo/sdpo_config.py
This reverts commit e7c6380, reversing
changes made to 0718dbc.
Comment thread trl/experimental/sdpo/sdpo.py
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit cb2c378. Configure here.

Comment thread trl/experimental/sdft/sdft_trainer.py
@kashif kashif self-assigned this Jun 4, 2026
@kashif
Copy link
Copy Markdown
Collaborator

kashif commented Jun 4, 2026

Went through this and ran it on 2 GPUs. The convex loss matches the paper (section 4.5) and what we discussed earlier, fsdp2 trains fine at both distillation_weight=0.5 and 1.0, and the experimental tests pass (34). One small thing I noticed: distillation_weight is documented as [0, 1] but nothing enforced it, so a value like 1.5 silently makes the policy term negative. Pushed a tiny guard in __post_init__ for it. Everything else looks good to me, and the fsdp summon matches what grpo/rloo already do. LGTM 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants