Agentic RL: Token-In, Token-Out Done Right
• 11
The AI community building the future.
Seeing the Needle in the Haystack: Towards Weakly-Supervised Log Instance Anomaly Localization via Counterfactual Perturbation
Qualixar OS: A Universal Operating System for AI Agent Orchestration
hf skills add train-sentence-transformers --claude and ask Claude Code (or Codex / Cursor / Gemini CLI) to fine-tune a SentenceTransformer, CrossEncoder, or SparseEncoder model on your data.hf-mem release added a breakdown of Mixture-of-Experts (MoE) memory usage!hf-mem now splits MoE memory into base model weights, routed experts, and KV cachehf skills add train-sentence-transformers, then just describe what you want, e.g. "finetune a reranker on my (question, answer) pairs, mine hard negatives, and push it to the Hub".[batch × seq × vocab] logits tensor before computing cross-entropy, which dominates peak memory at long context lengths. The new loss_type="chunked_nll" path drops ignored-label tokens before the lm_head matmul and computes cross-entropy in checkpointed chunks of 256.nll, and unlocks sequence lengths that don't fit at all under the standard path.SFTConfig(loss_type="chunked_nll")trl.experimental.openreward adapter plugs any environment speaking the [Open Reward Standard](https://openrewardstandard.io) protocol into any TRL trainer that takes an environment_factory. One string — a catalog name or a URL — wires the dataset, factory, and reward_func slots; tools are bound dynamically from JSON Schema, no per-env wrapper code:from trl import GRPOTrainer
from trl.experimental.openreward import OpenRewardSpec
spec = OpenRewardSpec("Eigent/SETA", num_tasks=64)
trainer = GRPOTrainer(
...,
train_dataset=spec.train_dataset,
environment_factory=spec.environment_factory,
reward_funcs=spec.reward_funcs,
)Qwen/Qwen3.6-27B, Qwen/Qwen3.6-35B-A3B) reuses the Qwen3.5-MoE architecture but ships a slightly different chat template, so we updated the stack end-to-end: new training template with {% generation %} markers, tool-call response schema routing, tiny test models for the VLM matrix.from trl import SFTConfig, SFTTrainer
trainer = SFTTrainer(
model="Qwen/Qwen3.6-27B",
args=SFTConfig(assistant_only_loss=True),
train_dataset=dataset,
)
trainer.train()tools=[...] to GRPOTrainer.trl vllm-serve (Qwen3 MTP / Eagle3 drafts), 12 more KTO ↔ DPO alignment PRs (KTO promotion to stable is now in reach), three more {% generation %} chat templates (Gemma/Gemma 2, Phi-3, GLM-4-MoE), and a chunky SFT entropy bug fix.from trl.experimental.ssd import SSDConfig, SSDTrainer
trainer = SSDTrainer(
model="Qwen/Qwen3-4B-Instruct",
args=SSDConfig(temperature=0.6, top_k=20, top_p=0.95),
train_dataset=dataset,
)
trainer.train()use_transformers_paged, and key fixes for VLM response parsing.