Fix partial reward margins in collator by he-yufeng · Pull Request #5924 · huggingface/trl

he-yufeng · 2026-06-02T21:30:58Z

What does this PR do?

DataCollatorForPreference currently decides whether to emit a margin tensor by checking only examples[0]. That makes the result depend on batch order:

if the first example has no margin, later margins are silently dropped
if the first example has margin but a later example does not, collation raises KeyError

This PR checks the whole batch instead. If any example provides margin, the collator emits a margin tensor and defaults missing margins to 0.0, matching the optional-field behavior documented for this collator.

I added regression tests for both mixed-order cases.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

AI writing disclosure

We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.

No AI usage: the PR was written entirely by a human.
AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
AI-generated: the PR was mostly or fully generated by an AI tool.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Tests

python -m pytest tests/test_reward_trainer.py::TestDataCollatorForPreference -q
python -m ruff check trl\trainer\reward_trainer.py tests\test_reward_trainer.py
python -m py_compile trl\trainer\reward_trainer.py tests\test_reward_trainer.py
git diff --check

Note

Medium Risk
Changes reward-training batch collation and can alter loss when batches mix examples with and without margins; scope is small and covered by new tests.

Overview
Fixes DataCollatorForPreference so optional margin values are handled correctly when only some examples in a batch define them.

Previously, collation keyed off examples[0] only: margins on later rows could be dropped, or a missing key on a later row could raise KeyError. The collator now emits a margin tensor if any example has margin, and fills missing entries with 0.0.

Two unit tests cover mixed batches (margin missing on the first vs. on a later example).

^{Reviewed by Cursor Bugbot for commit 5988fbb. Bugbot is set up for automated code reviews on this repo. Configure here.}

qgallouedec · 2026-06-03T09:47:40Z

thanks, can you answer this before: #5539 (comment)

he-yufeng · 2026-06-04T03:47:04Z

Answered on the issue here: #5539 (comment)

Short version: I don't have a canonical paper example for intentionally mixed-margin rows. The practical issue is that margin is optional per example, but the current collator makes the batch decision from only examples[0], so behavior depends on shuffle/order. I also noted that I can switch the PR to a clear ValueError for mixed batches if you prefer stricter semantics.

fix: keep partial reward margins in collator

5988fbb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix partial reward margins in collator#5924

Fix partial reward margins in collator#5924
he-yufeng wants to merge 1 commit into
huggingface:mainfrom
he-yufeng:fix/reward-collator-partial-margin

he-yufeng commented Jun 2, 2026 •

edited by cursor Bot

Loading

Uh oh!

qgallouedec commented Jun 3, 2026

Uh oh!

he-yufeng commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

he-yufeng commented Jun 2, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

AI writing disclosure

Who can review?

Tests

Uh oh!

qgallouedec commented Jun 3, 2026

Uh oh!

he-yufeng commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

he-yufeng commented Jun 2, 2026 •

edited by cursor Bot

Loading