Skip to content

Fix partial reward margins in collator#5924

Open
he-yufeng wants to merge 1 commit into
huggingface:mainfrom
he-yufeng:fix/reward-collator-partial-margin
Open

Fix partial reward margins in collator#5924
he-yufeng wants to merge 1 commit into
huggingface:mainfrom
he-yufeng:fix/reward-collator-partial-margin

Conversation

@he-yufeng
Copy link
Copy Markdown

@he-yufeng he-yufeng commented Jun 2, 2026

What does this PR do?

Fixes #5539.

DataCollatorForPreference currently decides whether to emit a margin tensor by checking only examples[0]. That makes the result depend on batch order:

  • if the first example has no margin, later margins are silently dropped
  • if the first example has margin but a later example does not, collation raises KeyError

This PR checks the whole batch instead. If any example provides margin, the collator emits a margin tensor and defaults missing margins to 0.0, matching the optional-field behavior documented for this collator.

I added regression tests for both mixed-order cases.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

AI writing disclosure

We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.

  • No AI usage: the PR was written entirely by a human.
  • AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
  • AI-generated: the PR was mostly or fully generated by an AI tool.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Tests

python -m pytest tests/test_reward_trainer.py::TestDataCollatorForPreference -q
python -m ruff check trl\trainer\reward_trainer.py tests\test_reward_trainer.py
python -m py_compile trl\trainer\reward_trainer.py tests\test_reward_trainer.py
git diff --check

Note

Medium Risk
Changes reward-training batch collation and can alter loss when batches mix examples with and without margins; scope is small and covered by new tests.

Overview
Fixes DataCollatorForPreference so optional margin values are handled correctly when only some examples in a batch define them.

Previously, collation keyed off examples[0] only: margins on later rows could be dropped, or a missing key on a later row could raise KeyError. The collator now emits a margin tensor if any example has margin, and fills missing entries with 0.0.

Two unit tests cover mixed batches (margin missing on the first vs. on a later example).

Reviewed by Cursor Bugbot for commit 5988fbb. Bugbot is set up for automated code reviews on this repo. Configure here.

@qgallouedec
Copy link
Copy Markdown
Member

thanks, can you answer this before: #5539 (comment)

@he-yufeng
Copy link
Copy Markdown
Author

Answered on the issue here: #5539 (comment)

Short version: I don't have a canonical paper example for intentionally mixed-margin rows. The practical issue is that margin is optional per example, but the current collator makes the batch decision from only examples[0], so behavior depends on shuffle/order. I also noted that I can switch the PR to a clear ValueError for mixed batches if you prefer stricter semantics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: DataCollatorForRewardModelingDataset checks margin only on examples[0], causing silent data loss after shuffle

2 participants