Skip to content

Align KTO with DPO: Align processing_class initialization#5578

Merged
albertvillanova merged 3 commits into
huggingface:mainfrom
albertvillanova:align-kto-dpo-processing_class
Apr 17, 2026
Merged

Align KTO with DPO: Align processing_class initialization#5578
albertvillanova merged 3 commits into
huggingface:mainfrom
albertvillanova:align-kto-dpo-processing_class

Conversation

@albertvillanova
Copy link
Copy Markdown
Member

@albertvillanova albertvillanova commented Apr 17, 2026

Align KTO with DPO: Align processing_class initialization.

This PR updates the KTOTrainer class to streamline and clarify how processing class is handled for data processing. The main changes include narrowing the accepted types for processing_class, updating its initialization logic, and ensuring consistent usage throughout the code. Documentation and type annotations have also been improved for clarity.

Part of:

Changes

Processing class handling improvements:

  • The accepted types for processing_class are now limited to PreTrainedTokenizerBase or ProcessorMixin, removing support for BaseImageProcessor and FeatureExtractionMixin.
  • If processing_class is not provided, it is now automatically loaded using AutoProcessor.from_pretrained, and appropriate error handling is added if the class is not of the expected type.
  • The code now consistently uses the tokenizer derived from the processing_class for padding and tokenization, and updates all relevant references accordingly.

Documentation and error handling:

  • The docstring for KTOTrainer and its __init__ method are updated to reflect the new requirements and initialization behavior for processing_class.
  • Removed redundant error checks for processing_class being None since it is now always set during initialization.
  • Minor cleanup, such as removing the unused assignment of self.processing_class.

Note

Medium Risk
Moderate risk because it changes KTOTrainer initialization defaults (auto-loading a processor and auto-setting pad_token), which can alter tokenization/padding behavior or break callers relying on previously accepted processor types.

Overview
Aligns KTOTrainer’s processing_class handling with DPOTrainer: it now only accepts a PreTrainedTokenizerBase or ProcessorMixin, auto-loads one via AutoProcessor when omitted, and normalizes usage through a derived tokenizer (including defaulting pad_token to eos_token).

Removes the prior requirement/error path for explicitly passing processing_class, updates the default collator and dataset tokenization steps to use the derived tokenizer, and updates docstrings/type hints accordingly.

Reviewed by Cursor Bugbot for commit 1585842. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit e303285. Configure here.

Comment thread trl/experimental/kto/kto_trainer.py Outdated
Comment thread trl/experimental/kto/kto_trainer.py
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment on lines +360 to +361
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question that came while reviewing the pr. Do we actually need this? Because technically we tokenize per sample, and we don't delegate the padding to the tokenizer (padding is done in the collator)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We instantiate the collator with pad_token_id=tokenizer.pad_token_id

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but we could instantiate it with tokenizer.pad_token or tokenizer.eos_token

@qgallouedec
Copy link
Copy Markdown
Member

The code now consistently uses the tokenizer derived from the processing_class for padding and tokenization, and updates all relevant references accordingly.

Why not the processing class?

@albertvillanova
Copy link
Copy Markdown
Member Author

Why not the processing class?

Because _tokenize and _process_tokens use atokenizer instance.

@albertvillanova albertvillanova merged commit 8ff0069 into huggingface:main Apr 17, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants