activation-steering

Star

Here are 37 public repositories matching this topic...

IBM / activation-steering

Star

[ICLR 2025] General-purpose activation steering library

alignment steering refusal representation-engineering activation-steering llm-steering

Updated Sep 18, 2025
Python

MaxBelitsky / cache-steering

Star

KV Cache Steering for Inducing Reasoning in Small Language Models

reasoning kv-cache large-language-models llm representation-engineering activation-steering reasoning-language-models cache-steering

Updated Jul 24, 2025
Python

knoveleng / steering

Star

[ACL 2026] - Official repo for the paper: "Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection"

ai-safety ai-alignment llm activation-steering

Updated May 22, 2026
Jupyter Notebook

a9lim / saklas

Star

Activation steering and trait monitoring for HuggingFace transformers

python ai transformers interpretability huggingface llm representation-engineering activation-steering

Updated Jun 6, 2026
Python

dmis-lab / ASGuard

Star

[ICLR 2026] ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

guard safety jailbreaking iclr interpretability activation-steering iclr2026

Updated Sep 30, 2025
Python

ant-research / Awesome-Refusal-Suppression

Star

A bilingual awesome list for refusal suppression research: benchmarks, papers, tools, models, and ecosystem updates.

awesome-list activation-steering over-refusal refusal-suppression safety-neurons safety-degradation evading-safety-alignment

Updated May 13, 2026

GinoShun / Accent-Activation-Steering

Star

Official code for "Activation Steering for Accent Adaptation in Speech Foundation Models" (Interspeech 2026). Parameter-free accent adaptation via mean-shift steering vectors — no weight updates, consistent WER reductions across 8 accents.

speech-recognition whisper asr interspeech accent-adaptation representation-engineering activation-steering qwen2-audio

Updated Mar 17, 2026
Python

jeewoo1025 / Awesome-Activation-Steering

Star

The paper list related to activation steering

interpretability activation-steering llm-alignment

Updated May 10, 2026

levashi / reprobe

Star

Phase-aware LLM activation steering and linear probing. A memory-efficient, practical implementation of Representation Engineering (RepE) for safety research.

transformers pytorch ai-safety mechanistic-interpretability llm-safety representation-engineering activation-steering linear-probes

Updated Apr 1, 2026
Python

sharanya-dasgupta001 / ARREST

Star

Accepted at 19th Conference of the European Chapter of the Association for Computational Linguistics, 2026

ai-safety adversarial-learning distribution-shift llm hallucination-mitigation activation-steering

Updated Jan 18, 2026
Python

aeon0199 / observer

Star

Token-time interpretability instrument for language models — measures how generation trajectories move, branch, and respond to perturbation. Deterministic intervention comparisons via SeedCache branchpoints, hysteresis protocol, and SAE-feature steering.

transformers pytorch language-models ai-safety sae interpretability mechanistic-interpretability nnsight activation-steering perturbation-experiments

Updated Apr 30, 2026
Python

mehuly25 / GptOss

Star

Contrastive Activation Steering in GPT-OSS-20B to explore the study of gender, race, religion, and refusal bias in Bilingual data, extending earlier Llama 2 experiments.

nlp-machine-learning bias-detection fairness-ai responsible-ai llm activation-steering gpt-oss-20b

Updated May 8, 2026
Jupyter Notebook

mc9625 / activation-steering-experiments

Star

Activation steering toolkit for Llama 3.2 3B — inject sensory-constructed vectors into model activations to alter processing dispositions. Web UI + API. Runs locally on consumer hardware.

language-models interpretability ai-art llm activation-steering

Updated May 19, 2026
TeX

Zishan-Shao / decodeshare

Star

🏆[ICML 2026 Spotlight] Official implementation of "DecodeShare: Tracing the Shared Subspace of LLM Decode-Time Decisions"

protocol large large-language-models mechanistic-interpretability activation-steering prefill-decode

Updated May 28, 2026
Python

mehuly25 / llama2

Star

Contrastive Activation Steering for Bias detection in Llama 2 7B — English & Italian.

nlp bias-detection fairness-ai llm activation-steering

Updated May 8, 2026
Jupyter Notebook

DUC123co / panopticon-lattice

Star

🌐 Model adversarial economics and AI alignment using the Panopticon Lattice, a multi-agent simulation exploring hidden collusion and system dynamics.

pytorch steganography evolutionary-algorithms game-theory multi-agent-systems ai-safety adversarial-learning ai-alignment mechanistic-interpretability activation-steering

Updated Jun 6, 2026
Python

hinanohart / dose

Star

Pharmakon-Eval: dual-use LLM evaluation via Pharmakon Separability Index (PSI) and Dose-Response Curve (DRC)

python interpretability llm-evaluation activation-steering abliteration dual-use pharmakon

Updated Jun 6, 2026
Python

dimagoodlookingagent / paper1-emotion-steering

Star

Code, vectors, and figures for the paper 'Emotion and authorization steering both move cheat; trained-probe suppression doesn't undo it: a mechanistic study in Gemma-2-2B'

gemma ai-safety ai-alignment llm mechanistic-interpretability representation-engineering activation-steering

Updated May 26, 2026
HTML

Seqev / retrieval-arc-delivery

Star

Three matched-control gates localize long-context retrieval failure in sparse attention to a single locus- delivery. Attention redirection acts as transport, while stored-state correction is not reusable. Delivery works reliably, is content-addressable (no positional decay), and achieves a 27-byte attribute→code binding. Mechanism: RAG micro-hints.

reproducible-research llama sparse-attention long-context mechanistic-interpretability prompt-compression activation-steering retrieval-augmentation matched-controls

Updated Jun 3, 2026
Python

mufxio / emotion-vector-bench

Star

Anthropic-style emotion-vector geometry, on any open-weight LLM, in one command. Frozen corpus + unified pipeline + statistical rigor + 5-model reference results.

benchmark emotion probe transformer llama mps mistral interpretability sparse-autoencoder apple-silicon llm mechanistic-interpretability anthropic qwen representation-engineering activation-steering open-weight-models

Updated May 8, 2026
Python

Improve this page

Add a description, image, and links to the activation-steering topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the activation-steering topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

activation-steering

Here are 37 public repositories matching this topic...

IBM / activation-steering

MaxBelitsky / cache-steering

knoveleng / steering

a9lim / saklas

dmis-lab / ASGuard

ant-research / Awesome-Refusal-Suppression

GinoShun / Accent-Activation-Steering

jeewoo1025 / Awesome-Activation-Steering

levashi / reprobe

sharanya-dasgupta001 / ARREST

aeon0199 / observer

mehuly25 / GptOss

mc9625 / activation-steering-experiments

Zishan-Shao / decodeshare

mehuly25 / llama2

DUC123co / panopticon-lattice

hinanohart / dose

dimagoodlookingagent / paper1-emotion-steering

Seqev / retrieval-arc-delivery

mufxio / emotion-vector-bench

Improve this page

Add this topic to your repo