Skip to content

Set _tokenizer as trainer attribute#5489

Merged
albertvillanova merged 26 commits into
huggingface:mainfrom
albertvillanova:set-tokenizer-attribute
Apr 20, 2026
Merged

Set _tokenizer as trainer attribute#5489
albertvillanova merged 26 commits into
huggingface:mainfrom
albertvillanova:set-tokenizer-attribute

Conversation

@albertvillanova
Copy link
Copy Markdown
Member

@albertvillanova albertvillanova commented Apr 9, 2026

Set _tokenizer attribute.

This is a draft PR. If validated, I will implement the same pattern in other trainers in this same PR.

CC: @qgallouedec

This PR refactors the GRPOTrainer to consistently use the self._tokenizer attribute throughout the class, replacing previous direct references to tokenizer, processing_class, or their attributes. This change improves code clarity, reduces ambiguity, and ensures that all tokenizer-related operations are performed on the correct instance, especially in complex scenarios involving processors and vision-language models (VLMs).

Changes

Tokenizer Attribute Refactoring:

  • All references to tokenizer, processing_class.tokenizer, or processing_class for tokenizer operations are now replaced with self._tokenizer. This includes setting pad/eos tokens, accessing token IDs, and calling tokenizer methods.

Chat Template and Response Schema Handling:

  • Logic for checking and adding response schemas, as well as selecting chat templates, now consistently uses self._tokenizer instead of processing_class or its variants. This ensures correct behavior when tools are enabled and for multi-turn training.

Vision-Language Model (VLM) Support:

  • Token ID lookups for vision and video tokens in VLMs now use self._tokenizer, ensuring compatibility and correctness across different model families.

These changes make the codebase more robust and maintainable by centralizing tokenizer logic and reducing potential bugs from inconsistent attribute usage.


Note

Medium Risk
Mostly a mechanical refactor, but it touches token ID/padding/EOS detection and tool-call parsing paths used during generation and masking, which could subtly change training behavior across several trainers.

Overview
Refactors several trainers (GRPOTrainer, RLOOTrainer, DPOTrainer, and experimental DPPO/GFPO/GRPOWithReplayBuffer) to store the tokenizer as self._tokenizer and consistently use it for pad_token_id/eos_token_id, padding, truncation detection, and parse_response (tool-call/structured output decoding).

Updates initialization logic to ensure pad_token is set on the underlying tokenizer and adjusts tests to mock _tokenizer instead of per-trainer pad_token_id/eos_token_id attributes.

Reviewed by Cursor Bugbot for commit 59b217e. Bugbot is set up for automated code reviews on this repo. Configure here.

@albertvillanova albertvillanova changed the title Set _tokenizer attribute Set _tokenizer as trainer attribute Apr 9, 2026
Comment thread trl/trainer/grpo_trainer.py Outdated
and isinstance(tokenizer, PreTrainedTokenizerBase)
and hasattr(tokenizer, "response_schema") # attribute not set by default for now
and tokenizer.response_schema is not None # only works if the tokenizer has a schema
and isinstance(self._tokenizer, PreTrainedTokenizerBase)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this condition is redundant now.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment thread trl/trainer/grpo_trainer.py Outdated
Comment thread trl/trainer/grpo_trainer.py
Comment on lines +1797 to +1798
and hasattr(self._tokenizer, "response_schema") # attribute not set by default for now
and self._tokenizer.response_schema is not None # only works if the tokenizer has a schema
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these 2 conditions can be combined into 1:

Suggested change
and hasattr(self._tokenizer, "response_schema") # attribute not set by default for now
and self._tokenizer.response_schema is not None # only works if the tokenizer has a schema
and getattr(self._tokenizer, "response_schema", None) # only works if the tokenizer has a schema: attribute not set by default for now

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, although I feel it's less readable like this

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the current can be read smoothly: "the tokenizer has an attribute "response_schema" and this attribute response_schema is not None"

Comment thread trl/trainer/grpo_trainer.py Outdated
Comment thread tests/test_grpo_trainer.py Outdated
Copy link
Copy Markdown
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks good! there are a few comments to address, mostly due to recent changes in grpo

Comment thread trl/trainer/grpo_trainer.py Outdated
post_tool_completions = [
parse_response(self.processing_class, ids) if ids else {} for ids in post_tool_ids
]
post_tool_completions = [parse_response(self._tokenizer, ids) if ids else {} for ids in post_tool_ids]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_response now handles both processor and tokenizer, this should be reverted

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_response only needs a tokenizer instance but it had to handle both because we did not have a simple way to pass only tokenizer. Once we implement self._tokenizer in all trainers, parse_response could be simplified to accept only tokenizer instances.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note there are already other functions that only accept tokenizer instances: add_response_schema, is_chat_template_prefix_preserving, get_training_chat_template.

More broadly, the underlying goal of this PR is to centralize the processor/tokenizer disambiguation within processing_class in a single place, so that the rest of the code can rely on a well-defined and consistent interface, with a clear expected class instance.

In that sense, the current change in calling parse_response is an intermediate step toward that simplification, rather than a deviation from it.

Copy link
Copy Markdown
Member

@qgallouedec qgallouedec Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, let's revert the VLM support in parse_response then.

For the record, a few things to keep in mind going forward. Accessing the inner tokenizer isn't always safe: for apply_chat_template, the processor does extra work (image token expansion), so processor.apply_chat_template is not just an alias for processor.tokenizer.apply_chat_template.
Applying that to each helper:

  • add_response_schema: tokenizer-only is fine. The processor doesn't expose a schema, and I'd expect schema handling to stay tokenizer-level even if upstream adds it — no chat-template-style side effects.
    EDIT: because we check the chat template in add_response_schema, we should actually pass the processor, even if we only set processor.tokenizer.response_schema

  • parse_response: same story, parse_response only exists on the tokenizer today, so we have no choice. And even if upstream eventually moves it to the processor, I'm not too worried: parsing is not the same kind of operation as applying a chat template, so the image-token caveat doesn't really apply here. Revert VLM support in parse_response #5561

  • is_chat_template_prefix_preserving: this one is actually wrong as-is. It should support VLMs (for the reason above), and the call site change in this PR should and will be reverted:

    - if self.tools and not is_chat_template_prefix_preserving(self._tokenizer):
    + if self.tools and not is_chat_template_prefix_preserving(processing_class):

    We'll extend the function in a follow-up. Support VLM processors in is_chat_template_prefix_preserving #5558

  • get_training_chat_template: contrary to what the type hint suggests (needs updating), it already supports both tokenizer and processor, and should be called on the processor. My rule of thumb: with a processor, never manipulate the inner chat template directly. Accept processor in get_training_chat_template #5560

Comment thread trl/trainer/grpo_trainer.py Outdated
and isinstance(tokenizer, PreTrainedTokenizerBase)
and hasattr(tokenizer, "response_schema") # attribute not set by default for now
and tokenizer.response_schema is not None # only works if the tokenizer has a schema
and isinstance(self._tokenizer, PreTrainedTokenizerBase)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

and self._tokenizer.response_schema is not None # only works if the tokenizer has a schema
):
completions = [[parse_response(parsing_class, ids)] for ids in completion_ids]
completions = [[parse_response(self._tokenizer, ids)] for ids in completion_ids]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread trl/trainer/grpo_trainer.py Outdated
Comment thread trl/trainer/grpo_trainer.py
Comment thread trl/trainer/grpo_trainer.py Outdated
has_response_schema = getattr(self._tokenizer, "response_schema", None)
if self.tools and not has_response_schema:
processing_class = add_response_schema(processing_class)
self._tokenizer = add_response_schema(self._tokenizer)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we add the response schema by checking the chat template in add_response_schema. But processor.chat_template and processor.tokenizer.chat_template are independent attributes, and processor.apply_chat_template reads only the former, so they can silently diverge if either is mutated. In practice they're populated from the same source files on load and match today for all tested VLMs, but technically, patching one doesn't patch the other, which is why I think we should keep

Suggested change
self._tokenizer = add_response_schema(self._tokenizer)
processing_class = add_response_schema(processing_class)

and extend add_response_schema to processor: #5520

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I understood you said it was OK to pass just tokenizer: #5489 (comment)

  • add_response_schema: tokenizer-only is fine. The processor doesn't expose a schema, and I'd expect schema handling to stay tokenizer-level even if upstream adds it — no chat-template-style side effects.

Copy link
Copy Markdown
Member Author

@albertvillanova albertvillanova Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read your PR and I'm not sure of fully understanding it: isn't it overlapping with this PR?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah well actually I was not correct: add_response_schema should actually take the processor, not because of the schema, but because of the template: there is no guaranty that processor.chat_template == processor.tokenizer.chat_template. In other words

# this would be correct in the vast majority of cases
def add_response_schema(tokenizer):
    tokenizer.response_schema = SCHEMAS[tokenizer.chat_template]

# but this is better
def add_response_schema(processor):
    processor.tokenizer.response_schema = SCHEMAS[processor.chat_template]

Copy link
Copy Markdown
Member Author

@albertvillanova albertvillanova Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After having thought longer about this, I'm not sure about the logic in transformers' implementation of this...

  • Shouldn't they enforce that both (processor and tokenizer) chat_template's are the same?
  • Does it make sense that a tokenizer.response_schema is aligned with its processor.chat_template, but not with its own tokenizer.chat_template?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 5cb4c39. Configure here.

Comment thread trl/trainer/grpo_trainer.py Outdated
Copy link
Copy Markdown
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@albertvillanova albertvillanova merged commit 6e1705a into huggingface:main Apr 20, 2026
18 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants