Mechanistic Interpretability Hub

LiteInception: A Lightweight and Interpretable Deep Learning Framework for General Aviation Fault Diagnosis

Thu, 02 Apr 2026 00:00:00 +0000

Authors: Zhihuan Wei, Xinhang Chen, Danyang Han, Yang Hu, Jie Liu, et al.

General aviation fault diagnosis and efficient maintenance are critical to flight safety; however, deploying deep learning models on resource-constrained edge devices poses dual challenges in computational capacity and interpretability. This paper proposes LiteInception--a lightweight interpretable fault diagnosis framework designed for edge deployment. The framework adopts a two-stage cascaded architecture aligned with standard maintenance workflows: Stage 1 performs high-recall fault detection

Tags: safety

Identifying and Estimating Causal Direct Effects Under Unmeasured Confounding

Thu, 02 Apr 2026 00:00:00 +0000

Authors: Philippe Boileau, Nima S. Hejazi, Ivana Malenica, Peter B. Gilbert, Sandrine Dudoit, et al.

Causal mediation analysis provides techniques for defining and estimating effects that may be endowed with mechanistic interpretations. With many scientific investigations seeking to address mechanistic questions, causal direct and indirect effects have garnered much attention. The natural direct and indirect effects, the most widely used among such causal mediation estimands, are limited in their practical utility due to stringent identification requirements. Accordingly, considerable effort ha

Automatic Image-Level Morphological Trait Annotation for Organismal Images

Thu, 02 Apr 2026 00:00:00 +0000

Authors: Vardaan Pahuja, Samuel Stevens, Alyson East, Sydne Record, Yu Su

Morphological traits are physical characteristics of biological organisms that provide vital clues on how organisms interact with their environment. Yet extracting these traits remains a slow, expert-driven process, limiting their use in large-scale ecological studies. A major bottleneck is the absence of high-quality datasets linking biological images to trait-level annotations. In this work, we demonstrate that sparse autoencoders trained on foundation-model features yield monosemantic, spatia

Tags: SAE, features, vision, biology

The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level

Thu, 02 Apr 2026 00:00:00 +0000

Authors: Jeremy Herbst, Jae Hee Lee, Stefan Wermter

Mixture-of-Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed-forward networks (FFNs). We compare MoE experts and dense FFNs using $k$-sparse probing and find that expert neurons are consistently less polysemantic,

Tags: features, probing

Fragile Reasoning: A Mechanistic Analysis of LLM Sensitivity to Meaning-Preserving Perturbations

Thu, 02 Apr 2026 00:00:00 +0000

Authors: Shou-Tzu Han, Rodrigue Rizk, KC Santosh

Large language models demonstrate strong performance on mathematical reasoning benchmarks, yet remain surprisingly fragile to meaning-preserving surface perturbations. We systematically evaluate three open-weight LLMs, Mistral-7B, Llama-3-8B, and Qwen2.5-7B, on 677 GSM8K problems paired with semantically equivalent variants generated through name substitution and number format paraphrasing. All three models exhibit substantial answer-flip rates (28.8%-45.1%), with number paraphrasing consistentl

Tags: reasoning

ViT-Explainer: An Interactive Walkthrough of the Vision Transformer Pipeline

Thu, 02 Apr 2026 00:00:00 +0000

Authors: Juan Manuel Hernandez, Mariana Fernandez-Espinosa, Denis Parra, Diego Gomez-Zara

Transformer-based architectures have become the shared backbone of natural language processing and computer vision. However, understanding how these models operate remains challenging, particularly in vision settings, where images are processed as sequences of patch tokens. Existing interpretability tools often focus on isolated components or expert-oriented analysis, leaving a gap in guided, end-to-end understanding of the full inference pipeline. To bridge this gap, we present ViT-Explainer, a

Tags: vision

When Reward Hacking Rebounds: Understanding and Mitigating It with Representation-Level Signals

Wed, 01 Apr 2026 00:00:00 +0000

Authors: Rui Wu, Ruixiang Tang

Reinforcement learning for LLMs is vulnerable to reward hacking, where models exploit shortcuts to maximize reward without solving the intended task. We systematically study this phenomenon in coding tasks using an environment-manipulation setting, where models can rewrite evaluator code to trivially pass tests without solving the task, as a controlled testbed. Across both studied models, we identify a reproducible three-phase rebound pattern: models first attempt to rewrite the evaluator but fa

SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits

Wed, 01 Apr 2026 00:00:00 +0000

Authors: Zikai Zhang, Rui Hu, Olivera Kotevska, Jiahao Xu

Large Language Models (LLMs) are powerful tools for answering user queries, yet they remain highly vulnerable to jailbreak attacks. Existing guardrail methods typically rely on internal features or textual responses to detect malicious queries, which either introduce substantial latency or suffer from the randomness in text generation. To overcome these limitations, we propose SelfGrader, a lightweight guardrail method that formulates jailbreak detection as a numerical grading problem using toke

Tags: features

Polysemanticity or Polysemy? Lexical Identity Confounds Superposition Metrics

Wed, 01 Apr 2026 00:00:00 +0000

Authors: Iyad Ait Hou, Rebecca Hwa

If the same neuron activates for both "lender" and "riverside," standard metrics attribute the overlap to superposition--the neuron must be compressing two unrelated concepts. This work explores how much of the overlap is due a lexical confound: neurons fire for a shared word form (such as "bank") rather than for two compressed concepts. A 2x2 factorial decomposition reveals that the lexical-only condition (same word, different meaning) consistently exceeds the semantic-only condition (different

Tags: superposition, features

Tracking Equivalent Mechanistic Interpretations Across Neural Networks

Tue, 31 Mar 2026 00:00:00 +0000

Authors: Alan Sun, Mariya Toneva

Mechanistic interpretability (MI) is an emerging framework for interpreting neural networks. Given a task and model, MI aims to discover a succinct algorithmic process, an interpretation, that explains the model's decision process on that task. However, MI is difficult to scale and generalize. This stems in part from two key challenges: there is no precise notion of a valid interpretation; and, generating interpretations is often an ad hoc process. In this paper, we address these challenges by d

Tucker Attention: A generalization of approximate attention mechanisms

Tue, 31 Mar 2026 00:00:00 +0000

Authors: Timon Klein, Jonas Kusch, Sebastian Sager, Stefan Schnake, Steffen Schotthöfer

The pursuit of reducing the memory footprint of the self-attention mechanism in multi-headed self attention (MHA) spawned a rich portfolio of methods, e.g., group-query attention (GQA) and multi-head latent attention (MLA). The methods leverage specialized low-rank factorizations across embedding dimensions or attention heads. From the point of view of classical low-rank approximation, these methods are unconventional and raise questions of which objects they really approximate and how to interp

Tags: attention

Hybrid Energy-Based Models for Physical AI: Provably Stable Identification of Port-Hamiltonian Dynamics

Tue, 31 Mar 2026 00:00:00 +0000

Authors: Simone Betteti, Luca Laurenti

Energy-based models (EBMs) implement inference as gradient descent on a learned Lyapunov function, yielding interpretable, structure-preserving alternatives to black-box neural ODEs and aligning naturally with physical AI. Yet their use in system identification remains limited, and existing architectures lack formal stability guarantees that globally preclude unstable modes. We address this gap by introducing an EBM framework for system identification with stable, dissipative, absorbing invarian

Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLMs

Sun, 29 Mar 2026 00:00:00 +0000

Authors: Utsav Maskey, Mark Dras, Usman Naseem

Aligned language models that are trained to refuse harmful requests also exhibit over-refusal: they decline safe instructions that seemingly resemble harmful instructions. A natural approach is to ablate the global refusal direction, steering the hidden-state vectors away or towards the harmful-refusal examples, but this corrects over-refusal only incidentally while disrupting the broader refusal mechanism. In this work, we analyse the representational geometry of both refusal types to understan

Tags: safety, steering

ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding

Sat, 28 Mar 2026 00:00:00 +0000

Authors: Jovana Kondic, Pengyuan Li, Dhiraj Joshi, Isaac Sanchez, Ben Wiesel, et al.

Understanding charts requires models to jointly reason over geometric visual patterns, structured numerical data, and natural language -- a capability where current vision-language models (VLMs) remain limited. We introduce ChartNet, a high-quality, million-scale multimodal dataset designed to advance chart interpretation and reasoning. ChartNet leverages a novel code-guided synthesis pipeline to generate 1.5 million diverse chart samples spanning 24 chart types and 6 plotting libraries. Each sa

Tags: vision, reasoning

From Human Cognition to Neural Activations: Probing the Computational Primitives of Spatial Reasoning in LLMs

Fri, 27 Mar 2026 00:00:00 +0000

Authors: Jiyuan An, Liner Yang, Mengyan Wang, Luming Lu, Weihua An, et al.

As spatial intelligence becomes an increasingly important capability for foundation models, it remains unclear whether large language models' (LLMs) performance on spatial reasoning benchmarks reflects structured internal spatial representations or reliance on linguistic heuristics. We address this question from a mechanistic perspective by examining how spatial information is internally represented and used. Drawing on computational theories of human spatial cognition, we decompose spatial reas

Tags: probing, reasoning

A Universal Vibe? Finding and Controlling Language-Agnostic Informal Register with SAEs

Fri, 27 Mar 2026 00:00:00 +0000

Authors: Uri Z. Kialy, Avi Shtarkberg, Ayal Klein

While multilingual language models successfully transfer factual and syntactic knowledge across languages, it remains unclear whether they process culture-specific pragmatic registers, such as slang, as isolated language-specific memorizations or as unified, abstract concepts. We study this by probing the internal representations of Gemma-2-9B-IT using Sparse Autoencoders (SAEs) across three typologically diverse source languages: English, Hebrew, and Russian. To definitively isolate pragmatic r

Tags: SAE, probing

Identifying Connectivity Distributions from Neural Dynamics Using Flows

Fri, 27 Mar 2026 00:00:00 +0000

Authors: Timothy Doyeon Kim, Ulises Pereira-Obilinovic, Yiliu Wang, Eric Shea-Brown, Uygar Sümbül

Connectivity structure shapes neural computation, but inferring this structure from population recordings is degenerate: multiple connectivity structures can generate identical dynamics. Recent work uses low-rank recurrent neural networks (lrRNNs) to infer low-dimensional latent dynamics and connectivity structure from observed activity, enabling a mechanistic interpretation of the dynamics. However, standard approaches for training lrRNNs can recover spurious structures irrelevant to the underl

Entanglement as Memory: Mechanistic Interpretability of Quantum Language Models

Fri, 27 Mar 2026 00:00:00 +0000

Authors: Nathan Roll

Quantum language models have shown competitive performance on sequential tasks, yet whether trained quantum circuits exploit genuinely quantum resources -- or merely embed classical computation in quantum hardware -- remains unknown. Prior work has evaluated these models through endpoint metrics alone, without examining the memory strategies they actually learn internally. We introduce the first mechanistic interpretability study of quantum language models, combining causal gate ablation, entang

Tags: circuits

Squish and Release: Exposing Hidden Hallucinations by Making Them Surface as Safety Signals

Fri, 27 Mar 2026 00:00:00 +0000

Authors: Nathaniel Oh, Paul Attie

Language models detect false premises when asked directly but absorb them under conversational pressure, producing authoritative professional output built on errors they already identified. This failure - order-gap hallucination - is invisible to output inspection because the error migrates into the activation space of the safety circuit, suppressed but not erased. We introduce Squish and Release (S&R), an activation-patching architecture with two components: a fixed detector body (layers 24-31,

Tags: circuits, safety

Closing the Confidence-Faithfulness Gap in Large Language Models

Thu, 26 Mar 2026 00:00:00 +0000

Authors: Miranda Muqing Miao, Lyle Ungar

Large language models (LLMs) tend to verbalize confidence scores that are largely detached from their actual accuracy, yet the geometric relationship governing this behavior remain poorly understood. In this work, we present a mechanistic interpretability analysis of verbalized confidence, using linear probes and contrastive activation addition (CAA) steering to show that calibration and verbalized confidence signals are encoded linearly but are orthogonal to one another -- a finding consistent

Tags: probing, steering

Mechanistically Interpreting Compression in Vision-Language Models

Thu, 26 Mar 2026 00:00:00 +0000

Authors: Veeraraju Elluru, Arth Singh, Roberto Aguero, Ajay Agarwal, Debojyoti Das, et al.

Compressed vision-language models (VLMs) are widely used to reduce memory and compute costs, making them a suitable choice for real-world deployment. However, compressing these models raises concerns about whether internal computations and safety behaviors are preserved. In this work, we use causal circuit analysis and crosscoder-based feature comparisons to examine how pruning and quantization fundamentally change the internals across representative VLMs. We observe that pruning generally keeps

Tags: circuits, features, safety, vision

How Pruning Reshapes Features: Sparse Autoencoder Analysis of Weight-Pruned Language Models

Thu, 26 Mar 2026 00:00:00 +0000

Authors: Hector Borobia, Elies Seguí-Mas, Guillermina Tormo-Carbó

Weight pruning is a standard technique for compressing large language models, yet its effect on learned internal representations remains poorly understood. We present the first systematic study of how unstructured pruning reshapes the feature geometry of language models, using Sparse Autoencoders (SAEs) as interpretability probes. Across three model families (Gemma 3 1B, Gemma 2 2B, Llama 3.2 1B), two pruning methods (magnitude and Wanda), and six sparsity levels (0--60%), we investigate five re

Tags: SAE, features, probing

Sparse Visual Thought Circuits in Vision-Language Models

Thu, 26 Mar 2026 00:00:00 +0000

Authors: Yunpeng Zhou

Sparse autoencoders (SAEs) improve interpretability in multimodal models, but it remains unclear whether SAE features form modular, composable units for reasoning-an assumption underlying many intervention-based steering methods. We test this modularity hypothesis and find it often fails: intervening on a task-selective feature set can modestly improve reasoning accuracy, while intervening on the union of two such sets reliably induces output drift (large unintended changes in predictions) and d

Tags: SAE, circuits, features, steering, vision, reasoning

Z-Erase: Enabling Concept Erasure in Single-Stream Diffusion Transformers

Thu, 26 Mar 2026 00:00:00 +0000

Authors: Nanxiang Jiang, Zhaoxin Fan, Baisen Wang, Daiheng Gao, Junhang Cheng, et al.

Concept erasure serves as a vital safety mechanism for removing unwanted concepts from text-to-image (T2I) models. While extensively studied in U-Net and dual-stream architectures (e.g., Flux), this task remains under-explored in the recent emerging paradigm of single-stream diffusion transformers (e.g., Z-Image). In this new paradigm, text and image tokens are processed as a single unified sequence via shared parameters. Consequently, directly applying prior erasure methods typically leads to g

Tags: safety, vision

A Neuro-Symbolic System for Interpretable Multimodal Physiological Signals Integration in Human Fatigue Detection

Wed, 25 Mar 2026 00:00:00 +0000

Authors: Mohammadreza Jamalifard, Yaxiong Lei, Parasto Azizinezhad, Javier Fumanal-Idocin, Javier Andreu-Perez

We propose a neuro-symbolic architecture that learns four interpretable physiological concepts, oculomotor dynamics, gaze stability, prefrontal hemodynamics, and multimodal, from eye-tracking and neural hemodynamics, functional near-infrared spectroscopy, (fNIRS) windows using attention-based encoders, and combines them with differentiable approximate reasoning rules using learned weights and soft thresholds, to address both rigid hand-crafted rules and the lack of subject-level alignment diagno

Tags: safety, vision, reasoning

From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition

Wed, 25 Mar 2026 00:00:00 +0000

Authors: Francesco Gentile, Nicola Dall'Asen, Francesco Tonini, Massimiliano Mancini, Lorenzo Vaquero, et al.

As vision-language models are deployed at scale, understanding their internal mechanisms becomes increasingly critical. Existing interpretability methods predominantly rely on activations, making them dataset-dependent, vulnerable to data bias, and often restricted to coarse head-level explanations. We introduce SITH (Semantic Inspection of Transformer Heads), a fully data-free, training-free framework that directly analyzes CLIP's vision transformer in weight space. For each attention head, we

Tags: attention, vision

Steering LLMs for Culturally Localized Generation

Tue, 24 Mar 2026 00:00:00 +0000

Authors: Simran Khanuja, Hongbin Liu, Shujian Zhang, John Lambert, Mingqing Chen, et al.

LLMs are deployed globally, yet produce responses biased towards cultures with abundant training data. Existing cultural localization approaches such as prompting or post-training alignment are black-box, hard to control, and do not reveal whether failures reflect missing knowledge or poor elicitation. In this paper, we address these gaps using mechanistic interpretability to uncover and manipulate cultural representations in LLMs. Leveraging sparse autoencoders, we identify interpretable featur

Tags: SAE, safety, steering

SafeSeek: Universal Attribution of Safety Circuits in Language Models

Tue, 24 Mar 2026 00:00:00 +0000

Authors: Miao Yu, Siyuan Fu, Moayad Aloqaily, Zhenhong Zhou, Safa Otoum, et al.

Mechanistic interpretability reveals that safety-critical behaviors (e.g., alignment, jailbreak, backdoor) in Large Language Models (LLMs) are grounded in specialized functional components. However, existing safety attribution methods struggle with generalization and reliability due to their reliance on heuristic, domain-specific metrics and search algorithms. To address this, we propose \ourmethod, a unified safety interpretability framework that identifies functionally complete safety circuits

Tags: circuits, safety

From Pixels to Semantics: A Multi-Stage AI Framework for Structural Damage Detection in Satellite Imagery

Tue, 24 Mar 2026 00:00:00 +0000

Authors: Bijay Shakya, Catherine Hoier, Khandaker Mamun Ahmed

Rapid and accurate structural damage assessment following natural disasters is critical for effective emergency response and recovery. However, remote sensing imagery often suffers from low spatial resolution, contextual ambiguity, and limited semantic interpretability, reducing the reliability of traditional detection pipelines. In this work, we propose a novel hybrid framework that integrates AI-based super-resolution, deep learning object detection, and Vision-Language Models (VLMs) for compr

Tags: vision

Sparse Autoencoders for Interpretable Medical Image Representation Learning

Tue, 24 Mar 2026 00:00:00 +0000

Authors: Philipp Wesp, Robbie Holland, Vasiliki Sideri-Lampretsa, Sergios Gatidis

Vision foundation models (FMs) achieve state-of-the-art performance in medical imaging. However, they encode information in abstract latent representations that clinicians cannot interrogate or verify. The goal of this study is to investigate Sparse Autoencoders (SAEs) for replacing opaque FM image representations with human-interpretable, sparse features. We train SAEs on embeddings from BiomedParse (biomedical) and DINOv3 (general-purpose) using 909,873 CT and MRI 2D image slices from the Tota

Tags: SAE, features, vision