Align KTO with DPO: Align processing_class initialization#5578
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit e303285. Configure here.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| if tokenizer.pad_token is None: | ||
| tokenizer.pad_token = tokenizer.eos_token |
There was a problem hiding this comment.
Just a question that came while reviewing the pr. Do we actually need this? Because technically we tokenize per sample, and we don't delegate the padding to the tokenizer (padding is done in the collator)
There was a problem hiding this comment.
We instantiate the collator with pad_token_id=tokenizer.pad_token_id
There was a problem hiding this comment.
yes, but we could instantiate it with tokenizer.pad_token or tokenizer.eos_token
Why not the processing class? |
Because |

Align KTO with DPO: Align processing_class initialization.
This PR updates the
KTOTrainerclass to streamline and clarify how processing class is handled for data processing. The main changes include narrowing the accepted types forprocessing_class, updating its initialization logic, and ensuring consistent usage throughout the code. Documentation and type annotations have also been improved for clarity.Part of:
Changes
Processing class handling improvements:
processing_classare now limited toPreTrainedTokenizerBaseorProcessorMixin, removing support forBaseImageProcessorandFeatureExtractionMixin.processing_classis not provided, it is now automatically loaded usingAutoProcessor.from_pretrained, and appropriate error handling is added if the class is not of the expected type.tokenizerderived from theprocessing_classfor padding and tokenization, and updates all relevant references accordingly.Documentation and error handling:
KTOTrainerand its__init__method are updated to reflect the new requirements and initialization behavior forprocessing_class.processing_classbeingNonesince it is now always set during initialization.self.processing_class.Note
Medium Risk
Moderate risk because it changes KTOTrainer initialization defaults (auto-loading a processor and auto-setting
pad_token), which can alter tokenization/padding behavior or break callers relying on previously accepted processor types.Overview
Aligns
KTOTrainer’sprocessing_classhandling withDPOTrainer: it now only accepts aPreTrainedTokenizerBaseorProcessorMixin, auto-loads one viaAutoProcessorwhen omitted, and normalizes usage through a derivedtokenizer(including defaultingpad_tokentoeos_token).Removes the prior requirement/error path for explicitly passing
processing_class, updates the default collator and dataset tokenization steps to use the derived tokenizer, and updates docstrings/type hints accordingly.Reviewed by Cursor Bugbot for commit 1585842. Bugbot is set up for automated code reviews on this repo. Configure here.