<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Mechanistic Interpretability Hub</title>
    <link>https://izkula.github.io/cc</link>
    <description>Latest research in mechanistic interpretability - understanding how neural networks work internally</description>
    <language>en-us</language>
    <lastBuildDate>Sun, 05 Apr 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://izkula.github.io/cc/feed.xml" rel="self" type="application/rss+xml"/>

    <item>
      <title>LiteInception: A Lightweight and Interpretable Deep Learning Framework for General Aviation Fault Diagnosis</title>
      <link>https://arxiv.org/abs/2604.01725v1</link>
      <description><![CDATA[<p><strong>Authors:</strong> Zhihuan Wei, Xinhang Chen, Danyang Han, Yang Hu, Jie Liu, et al.</p><p>General aviation fault diagnosis and efficient maintenance are critical to flight safety; however, deploying deep learning models on resource-constrained edge devices poses dual challenges in computational capacity and interpretability. This paper proposes LiteInception--a lightweight interpretable fault diagnosis framework designed for edge deployment. The framework adopts a two-stage cascaded architecture aligned with standard maintenance workflows: Stage 1 performs high-recall fault detection</p><p><strong>Tags:</strong> safety</p>]]></description>
      <pubDate>Thu, 02 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://arxiv.org/abs/2604.01725v1</guid>
    </item>
    <item>
      <title>Identifying and Estimating Causal Direct Effects Under Unmeasured Confounding</title>
      <link>https://arxiv.org/abs/2604.01501v1</link>
      <description><![CDATA[<p><strong>Authors:</strong> Philippe Boileau, Nima S. Hejazi, Ivana Malenica, Peter B. Gilbert, Sandrine Dudoit, et al.</p><p>Causal mediation analysis provides techniques for defining and estimating effects that may be endowed with mechanistic interpretations. With many scientific investigations seeking to address mechanistic questions, causal direct and indirect effects have garnered much attention. The natural direct and indirect effects, the most widely used among such causal mediation estimands, are limited in their practical utility due to stringent identification requirements. Accordingly, considerable effort ha</p>]]></description>
      <pubDate>Thu, 02 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://arxiv.org/abs/2604.01501v1</guid>
    </item>
    <item>
      <title>Automatic Image-Level Morphological Trait Annotation for Organismal Images</title>
      <link>https://arxiv.org/abs/2604.01619v1</link>
      <description><![CDATA[<p><strong>Authors:</strong> Vardaan Pahuja, Samuel Stevens, Alyson East, Sydne Record, Yu Su</p><p>Morphological traits are physical characteristics of biological organisms that provide vital clues on how organisms interact with their environment. Yet extracting these traits remains a slow, expert-driven process, limiting their use in large-scale ecological studies. A major bottleneck is the absence of high-quality datasets linking biological images to trait-level annotations. In this work, we demonstrate that sparse autoencoders trained on foundation-model features yield monosemantic, spatia</p><p><strong>Tags:</strong> SAE, features, vision, biology</p>]]></description>
      <pubDate>Thu, 02 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://arxiv.org/abs/2604.01619v1</guid>
    </item>
    <item>
      <title>The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level</title>
      <link>https://arxiv.org/abs/2604.02178v1</link>
      <description><![CDATA[<p><strong>Authors:</strong> Jeremy Herbst, Jae Hee Lee, Stefan Wermter</p><p>Mixture-of-Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed-forward networks (FFNs). We compare MoE experts and dense FFNs using $k$-sparse probing and find that expert neurons are consistently less polysemantic,</p><p><strong>Tags:</strong> features, probing</p>]]></description>
      <pubDate>Thu, 02 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://arxiv.org/abs/2604.02178v1</guid>
    </item>
    <item>
      <title>Fragile Reasoning: A Mechanistic Analysis of LLM Sensitivity to Meaning-Preserving Perturbations</title>
      <link>https://arxiv.org/abs/2604.01639v1</link>
      <description><![CDATA[<p><strong>Authors:</strong> Shou-Tzu Han, Rodrigue Rizk, KC Santosh</p><p>Large language models demonstrate strong performance on mathematical reasoning benchmarks, yet remain surprisingly fragile to meaning-preserving surface perturbations. We systematically evaluate three open-weight LLMs, Mistral-7B, Llama-3-8B, and Qwen2.5-7B, on 677 GSM8K problems paired with semantically equivalent variants generated through name substitution and number format paraphrasing. All three models exhibit substantial answer-flip rates (28.8%-45.1%), with number paraphrasing consistentl</p><p><strong>Tags:</strong> reasoning</p>]]></description>
      <pubDate>Thu, 02 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://arxiv.org/abs/2604.01639v1</guid>
    </item>
    <item>
      <title>ViT-Explainer: An Interactive Walkthrough of the Vision Transformer Pipeline</title>
      <link>https://arxiv.org/abs/2604.02182v1</link>
      <description><![CDATA[<p><strong>Authors:</strong> Juan Manuel Hernandez, Mariana Fernandez-Espinosa, Denis Parra, Diego Gomez-Zara</p><p>Transformer-based architectures have become the shared backbone of natural language processing and computer vision. However, understanding how these models operate remains challenging, particularly in vision settings, where images are processed as sequences of patch tokens. Existing interpretability tools often focus on isolated components or expert-oriented analysis, leaving a gap in guided, end-to-end understanding of the full inference pipeline. To bridge this gap, we present ViT-Explainer, a</p><p><strong>Tags:</strong> vision</p>]]></description>
      <pubDate>Thu, 02 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://arxiv.org/abs/2604.02182v1</guid>
    </item>
    <item>
      <title>When Reward Hacking Rebounds: Understanding and Mitigating It with Representation-Level Signals</title>
      <link>https://arxiv.org/abs/2604.01476v1</link>
      <description><![CDATA[<p><strong>Authors:</strong> Rui Wu, Ruixiang Tang</p><p>Reinforcement learning for LLMs is vulnerable to reward hacking, where models exploit shortcuts to maximize reward without solving the intended task. We systematically study this phenomenon in coding tasks using an environment-manipulation setting, where models can rewrite evaluator code to trivially pass tests without solving the task, as a controlled testbed. Across both studied models, we identify a reproducible three-phase rebound pattern: models first attempt to rewrite the evaluator but fa</p>]]></description>
      <pubDate>Wed, 01 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://arxiv.org/abs/2604.01476v1</guid>
    </item>
    <item>
      <title>SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits</title>
      <link>https://arxiv.org/abs/2604.01473v1</link>
      <description><![CDATA[<p><strong>Authors:</strong> Zikai Zhang, Rui Hu, Olivera Kotevska, Jiahao Xu</p><p>Large Language Models (LLMs) are powerful tools for answering user queries, yet they remain highly vulnerable to jailbreak attacks. Existing guardrail methods typically rely on internal features or textual responses to detect malicious queries, which either introduce substantial latency or suffer from the randomness in text generation. To overcome these limitations, we propose SelfGrader, a lightweight guardrail method that formulates jailbreak detection as a numerical grading problem using toke</p><p><strong>Tags:</strong> features</p>]]></description>
      <pubDate>Wed, 01 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://arxiv.org/abs/2604.01473v1</guid>
    </item>
    <item>
      <title>Polysemanticity or Polysemy? Lexical Identity Confounds Superposition Metrics</title>
      <link>https://arxiv.org/abs/2604.00443v1</link>
      <description><![CDATA[<p><strong>Authors:</strong> Iyad Ait Hou, Rebecca Hwa</p><p>If the same neuron activates for both &quot;lender&quot; and &quot;riverside,&quot; standard metrics attribute the overlap to superposition--the neuron must be compressing two unrelated concepts. This work explores how much of the overlap is due a lexical confound: neurons fire for a shared word form (such as &quot;bank&quot;) rather than for two compressed concepts. A 2x2 factorial decomposition reveals that the lexical-only condition (same word, different meaning) consistently exceeds the semantic-only condition (different</p><p><strong>Tags:</strong> superposition, features</p>]]></description>
      <pubDate>Wed, 01 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://arxiv.org/abs/2604.00443v1</guid>
    </item>
    <item>
      <title>Tracking Equivalent Mechanistic Interpretations Across Neural Networks</title>
      <link>https://arxiv.org/abs/2603.30002v1</link>
      <description><![CDATA[<p><strong>Authors:</strong> Alan Sun, Mariya Toneva</p><p>Mechanistic interpretability (MI) is an emerging framework for interpreting neural networks. Given a task and model, MI aims to discover a succinct algorithmic process, an interpretation, that explains the model&#x27;s decision process on that task. However, MI is difficult to scale and generalize. This stems in part from two key challenges: there is no precise notion of a valid interpretation; and, generating interpretations is often an ad hoc process. In this paper, we address these challenges by d</p>]]></description>
      <pubDate>Tue, 31 Mar 2026 00:00:00 +0000</pubDate>
      <guid>https://arxiv.org/abs/2603.30002v1</guid>
    </item>
    <item>
      <title>Tucker Attention: A generalization of approximate attention mechanisms</title>
      <link>https://arxiv.org/abs/2603.30033v1</link>
      <description><![CDATA[<p><strong>Authors:</strong> Timon Klein, Jonas Kusch, Sebastian Sager, Stefan Schnake, Steffen Schotthöfer</p><p>The pursuit of reducing the memory footprint of the self-attention mechanism in multi-headed self attention (MHA) spawned a rich portfolio of methods, e.g., group-query attention (GQA) and multi-head latent attention (MLA). The methods leverage specialized low-rank factorizations across embedding dimensions or attention heads. From the point of view of classical low-rank approximation, these methods are unconventional and raise questions of which objects they really approximate and how to interp</p><p><strong>Tags:</strong> attention</p>]]></description>
      <pubDate>Tue, 31 Mar 2026 00:00:00 +0000</pubDate>
      <guid>https://arxiv.org/abs/2603.30033v1</guid>
    </item>
    <item>
      <title>Hybrid Energy-Based Models for Physical AI: Provably Stable Identification of Port-Hamiltonian Dynamics</title>
      <link>https://arxiv.org/abs/2604.00277v1</link>
      <description><![CDATA[<p><strong>Authors:</strong> Simone Betteti, Luca Laurenti</p><p>Energy-based models (EBMs) implement inference as gradient descent on a learned Lyapunov function, yielding interpretable, structure-preserving alternatives to black-box neural ODEs and aligning naturally with physical AI. Yet their use in system identification remains limited, and existing architectures lack formal stability guarantees that globally preclude unstable modes. We address this gap by introducing an EBM framework for system identification with stable, dissipative, absorbing invarian</p>]]></description>
      <pubDate>Tue, 31 Mar 2026 00:00:00 +0000</pubDate>
      <guid>https://arxiv.org/abs/2604.00277v1</guid>
    </item>
    <item>
      <title>Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLMs</title>
      <link>https://arxiv.org/abs/2603.27518v1</link>
      <description><![CDATA[<p><strong>Authors:</strong> Utsav Maskey, Mark Dras, Usman Naseem</p><p>Aligned language models that are trained to refuse harmful requests also exhibit over-refusal: they decline safe instructions that seemingly resemble harmful instructions. A natural approach is to ablate the global refusal direction, steering the hidden-state vectors away or towards the harmful-refusal examples, but this corrects over-refusal only incidentally while disrupting the broader refusal mechanism. In this work, we analyse the representational geometry of both refusal types to understan</p><p><strong>Tags:</strong> safety, steering</p>]]></description>
      <pubDate>Sun, 29 Mar 2026 00:00:00 +0000</pubDate>
      <guid>https://arxiv.org/abs/2603.27518v1</guid>
    </item>
    <item>
      <title>ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding</title>
      <link>https://arxiv.org/abs/2603.27064v1</link>
      <description><![CDATA[<p><strong>Authors:</strong> Jovana Kondic, Pengyuan Li, Dhiraj Joshi, Isaac Sanchez, Ben Wiesel, et al.</p><p>Understanding charts requires models to jointly reason over geometric visual patterns, structured numerical data, and natural language -- a capability where current vision-language models (VLMs) remain limited. We introduce ChartNet, a high-quality, million-scale multimodal dataset designed to advance chart interpretation and reasoning. ChartNet leverages a novel code-guided synthesis pipeline to generate 1.5 million diverse chart samples spanning 24 chart types and 6 plotting libraries. Each sa</p><p><strong>Tags:</strong> vision, reasoning</p>]]></description>
      <pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate>
      <guid>https://arxiv.org/abs/2603.27064v1</guid>
    </item>
    <item>
      <title>From Human Cognition to Neural Activations: Probing the Computational Primitives of Spatial Reasoning in LLMs</title>
      <link>https://arxiv.org/abs/2603.26323v1</link>
      <description><![CDATA[<p><strong>Authors:</strong> Jiyuan An, Liner Yang, Mengyan Wang, Luming Lu, Weihua An, et al.</p><p>As spatial intelligence becomes an increasingly important capability for foundation models, it remains unclear whether large language models&#x27; (LLMs) performance on spatial reasoning benchmarks reflects structured internal spatial representations or reliance on linguistic heuristics. We address this question from a mechanistic perspective by examining how spatial information is internally represented and used. Drawing on computational theories of human spatial cognition, we decompose spatial reas</p><p><strong>Tags:</strong> probing, reasoning</p>]]></description>
      <pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate>
      <guid>https://arxiv.org/abs/2603.26323v1</guid>
    </item>
    <item>
      <title>A Universal Vibe? Finding and Controlling Language-Agnostic Informal Register with SAEs</title>
      <link>https://arxiv.org/abs/2603.26236v1</link>
      <description><![CDATA[<p><strong>Authors:</strong> Uri Z. Kialy, Avi Shtarkberg, Ayal Klein</p><p>While multilingual language models successfully transfer factual and syntactic knowledge across languages, it remains unclear whether they process culture-specific pragmatic registers, such as slang, as isolated language-specific memorizations or as unified, abstract concepts. We study this by probing the internal representations of Gemma-2-9B-IT using Sparse Autoencoders (SAEs) across three typologically diverse source languages: English, Hebrew, and Russian. To definitively isolate pragmatic r</p><p><strong>Tags:</strong> SAE, probing</p>]]></description>
      <pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate>
      <guid>https://arxiv.org/abs/2603.26236v1</guid>
    </item>
    <item>
      <title>Identifying Connectivity Distributions from Neural Dynamics Using Flows</title>
      <link>https://arxiv.org/abs/2603.26506v1</link>
      <description><![CDATA[<p><strong>Authors:</strong> Timothy Doyeon Kim, Ulises Pereira-Obilinovic, Yiliu Wang, Eric Shea-Brown, Uygar Sümbül</p><p>Connectivity structure shapes neural computation, but inferring this structure from population recordings is degenerate: multiple connectivity structures can generate identical dynamics. Recent work uses low-rank recurrent neural networks (lrRNNs) to infer low-dimensional latent dynamics and connectivity structure from observed activity, enabling a mechanistic interpretation of the dynamics. However, standard approaches for training lrRNNs can recover spurious structures irrelevant to the underl</p>]]></description>
      <pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate>
      <guid>https://arxiv.org/abs/2603.26506v1</guid>
    </item>
    <item>
      <title>Entanglement as Memory: Mechanistic Interpretability of Quantum Language Models</title>
      <link>https://arxiv.org/abs/2603.26494v1</link>
      <description><![CDATA[<p><strong>Authors:</strong> Nathan Roll</p><p>Quantum language models have shown competitive performance on sequential tasks, yet whether trained quantum circuits exploit genuinely quantum resources -- or merely embed classical computation in quantum hardware -- remains unknown. Prior work has evaluated these models through endpoint metrics alone, without examining the memory strategies they actually learn internally. We introduce the first mechanistic interpretability study of quantum language models, combining causal gate ablation, entang</p><p><strong>Tags:</strong> circuits</p>]]></description>
      <pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate>
      <guid>https://arxiv.org/abs/2603.26494v1</guid>
    </item>
    <item>
      <title>Squish and Release: Exposing Hidden Hallucinations by Making Them Surface as Safety Signals</title>
      <link>https://arxiv.org/abs/2603.26829v1</link>
      <description><![CDATA[<p><strong>Authors:</strong> Nathaniel Oh, Paul Attie</p><p>Language models detect false premises when asked directly but absorb them under conversational pressure, producing authoritative professional output built on errors they already identified. This failure - order-gap hallucination - is invisible to output inspection because the error migrates into the activation space of the safety circuit, suppressed but not erased. We introduce Squish and Release (S&amp;R), an activation-patching architecture with two components: a fixed detector body (layers 24-31,</p><p><strong>Tags:</strong> circuits, safety</p>]]></description>
      <pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate>
      <guid>https://arxiv.org/abs/2603.26829v1</guid>
    </item>
    <item>
      <title>Closing the Confidence-Faithfulness Gap in Large Language Models</title>
      <link>https://arxiv.org/abs/2603.25052v1</link>
      <description><![CDATA[<p><strong>Authors:</strong> Miranda Muqing Miao, Lyle Ungar</p><p>Large language models (LLMs) tend to verbalize confidence scores that are largely detached from their actual accuracy, yet the geometric relationship governing this behavior remain poorly understood. In this work, we present a mechanistic interpretability analysis of verbalized confidence, using linear probes and contrastive activation addition (CAA) steering to show that calibration and verbalized confidence signals are encoded linearly but are orthogonal to one another -- a finding consistent </p><p><strong>Tags:</strong> probing, steering</p>]]></description>
      <pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate>
      <guid>https://arxiv.org/abs/2603.25052v1</guid>
    </item>
    <item>
      <title>Mechanistically Interpreting Compression in Vision-Language Models</title>
      <link>https://arxiv.org/abs/2603.25035v1</link>
      <description><![CDATA[<p><strong>Authors:</strong> Veeraraju Elluru, Arth Singh, Roberto Aguero, Ajay Agarwal, Debojyoti Das, et al.</p><p>Compressed vision-language models (VLMs) are widely used to reduce memory and compute costs, making them a suitable choice for real-world deployment. However, compressing these models raises concerns about whether internal computations and safety behaviors are preserved. In this work, we use causal circuit analysis and crosscoder-based feature comparisons to examine how pruning and quantization fundamentally change the internals across representative VLMs. We observe that pruning generally keeps</p><p><strong>Tags:</strong> circuits, features, safety, vision</p>]]></description>
      <pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate>
      <guid>https://arxiv.org/abs/2603.25035v1</guid>
    </item>
    <item>
      <title>How Pruning Reshapes Features: Sparse Autoencoder Analysis of Weight-Pruned Language Models</title>
      <link>https://arxiv.org/abs/2603.25325v1</link>
      <description><![CDATA[<p><strong>Authors:</strong> Hector Borobia, Elies Seguí-Mas, Guillermina Tormo-Carbó</p><p>Weight pruning is a standard technique for compressing large language models, yet its effect on learned internal representations remains poorly understood. We present the first systematic study of how unstructured pruning reshapes the feature geometry of language models, using Sparse Autoencoders (SAEs) as interpretability probes. Across three model families (Gemma 3 1B, Gemma 2 2B, Llama 3.2 1B), two pruning methods (magnitude and Wanda), and six sparsity levels (0--60%), we investigate five re</p><p><strong>Tags:</strong> SAE, features, probing</p>]]></description>
      <pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate>
      <guid>https://arxiv.org/abs/2603.25325v1</guid>
    </item>
    <item>
      <title>Sparse Visual Thought Circuits in Vision-Language Models</title>
      <link>https://arxiv.org/abs/2603.25075v1</link>
      <description><![CDATA[<p><strong>Authors:</strong> Yunpeng Zhou</p><p>Sparse autoencoders (SAEs) improve interpretability in multimodal models, but it remains unclear whether SAE features form modular, composable units for reasoning-an assumption underlying many intervention-based steering methods. We test this modularity hypothesis and find it often fails: intervening on a task-selective feature set can modestly improve reasoning accuracy, while intervening on the union of two such sets reliably induces output drift (large unintended changes in predictions) and d</p><p><strong>Tags:</strong> SAE, circuits, features, steering, vision, reasoning</p>]]></description>
      <pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate>
      <guid>https://arxiv.org/abs/2603.25075v1</guid>
    </item>
    <item>
      <title>Z-Erase: Enabling Concept Erasure in Single-Stream Diffusion Transformers</title>
      <link>https://arxiv.org/abs/2603.25074v1</link>
      <description><![CDATA[<p><strong>Authors:</strong> Nanxiang Jiang, Zhaoxin Fan, Baisen Wang, Daiheng Gao, Junhang Cheng, et al.</p><p>Concept erasure serves as a vital safety mechanism for removing unwanted concepts from text-to-image (T2I) models. While extensively studied in U-Net and dual-stream architectures (e.g., Flux), this task remains under-explored in the recent emerging paradigm of single-stream diffusion transformers (e.g., Z-Image). In this new paradigm, text and image tokens are processed as a single unified sequence via shared parameters. Consequently, directly applying prior erasure methods typically leads to g</p><p><strong>Tags:</strong> safety, vision</p>]]></description>
      <pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate>
      <guid>https://arxiv.org/abs/2603.25074v1</guid>
    </item>
    <item>
      <title>A Neuro-Symbolic System for Interpretable Multimodal Physiological Signals Integration in Human Fatigue Detection</title>
      <link>https://arxiv.org/abs/2603.24358v1</link>
      <description><![CDATA[<p><strong>Authors:</strong> Mohammadreza Jamalifard, Yaxiong Lei, Parasto Azizinezhad, Javier Fumanal-Idocin, Javier Andreu-Perez</p><p>We propose a neuro-symbolic architecture that learns four interpretable physiological concepts, oculomotor dynamics, gaze stability, prefrontal hemodynamics, and multimodal, from eye-tracking and neural hemodynamics, functional near-infrared spectroscopy, (fNIRS) windows using attention-based encoders, and combines them with differentiable approximate reasoning rules using learned weights and soft thresholds, to address both rigid hand-crafted rules and the lack of subject-level alignment diagno</p><p><strong>Tags:</strong> safety, vision, reasoning</p>]]></description>
      <pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate>
      <guid>https://arxiv.org/abs/2603.24358v1</guid>
    </item>
    <item>
      <title>From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition</title>
      <link>https://arxiv.org/abs/2603.24653v1</link>
      <description><![CDATA[<p><strong>Authors:</strong> Francesco Gentile, Nicola Dall&#x27;Asen, Francesco Tonini, Massimiliano Mancini, Lorenzo Vaquero, et al.</p><p>As vision-language models are deployed at scale, understanding their internal mechanisms becomes increasingly critical. Existing interpretability methods predominantly rely on activations, making them dataset-dependent, vulnerable to data bias, and often restricted to coarse head-level explanations. We introduce SITH (Semantic Inspection of Transformer Heads), a fully data-free, training-free framework that directly analyzes CLIP&#x27;s vision transformer in weight space. For each attention head, we </p><p><strong>Tags:</strong> attention, vision</p>]]></description>
      <pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate>
      <guid>https://arxiv.org/abs/2603.24653v1</guid>
    </item>
    <item>
      <title>Steering LLMs for Culturally Localized Generation</title>
      <link>https://arxiv.org/abs/2603.23301v1</link>
      <description><![CDATA[<p><strong>Authors:</strong> Simran Khanuja, Hongbin Liu, Shujian Zhang, John Lambert, Mingqing Chen, et al.</p><p>LLMs are deployed globally, yet produce responses biased towards cultures with abundant training data. Existing cultural localization approaches such as prompting or post-training alignment are black-box, hard to control, and do not reveal whether failures reflect missing knowledge or poor elicitation. In this paper, we address these gaps using mechanistic interpretability to uncover and manipulate cultural representations in LLMs. Leveraging sparse autoencoders, we identify interpretable featur</p><p><strong>Tags:</strong> SAE, safety, steering</p>]]></description>
      <pubDate>Tue, 24 Mar 2026 00:00:00 +0000</pubDate>
      <guid>https://arxiv.org/abs/2603.23301v1</guid>
    </item>
    <item>
      <title>SafeSeek: Universal Attribution of Safety Circuits in Language Models</title>
      <link>https://arxiv.org/abs/2603.23268v1</link>
      <description><![CDATA[<p><strong>Authors:</strong> Miao Yu, Siyuan Fu, Moayad Aloqaily, Zhenhong Zhou, Safa Otoum, et al.</p><p>Mechanistic interpretability reveals that safety-critical behaviors (e.g., alignment, jailbreak, backdoor) in Large Language Models (LLMs) are grounded in specialized functional components. However, existing safety attribution methods struggle with generalization and reliability due to their reliance on heuristic, domain-specific metrics and search algorithms. To address this, we propose \ourmethod, a unified safety interpretability framework that identifies functionally complete safety circuits</p><p><strong>Tags:</strong> circuits, safety</p>]]></description>
      <pubDate>Tue, 24 Mar 2026 00:00:00 +0000</pubDate>
      <guid>https://arxiv.org/abs/2603.23268v1</guid>
    </item>
    <item>
      <title>From Pixels to Semantics: A Multi-Stage AI Framework for Structural Damage Detection in Satellite Imagery</title>
      <link>https://arxiv.org/abs/2603.22768v1</link>
      <description><![CDATA[<p><strong>Authors:</strong> Bijay Shakya, Catherine Hoier, Khandaker Mamun Ahmed</p><p>Rapid and accurate structural damage assessment following natural disasters is critical for effective emergency response and recovery. However, remote sensing imagery often suffers from low spatial resolution, contextual ambiguity, and limited semantic interpretability, reducing the reliability of traditional detection pipelines. In this work, we propose a novel hybrid framework that integrates AI-based super-resolution, deep learning object detection, and Vision-Language Models (VLMs) for compr</p><p><strong>Tags:</strong> vision</p>]]></description>
      <pubDate>Tue, 24 Mar 2026 00:00:00 +0000</pubDate>
      <guid>https://arxiv.org/abs/2603.22768v1</guid>
    </item>
    <item>
      <title>Sparse Autoencoders for Interpretable Medical Image Representation Learning</title>
      <link>https://arxiv.org/abs/2603.23794v1</link>
      <description><![CDATA[<p><strong>Authors:</strong> Philipp Wesp, Robbie Holland, Vasiliki Sideri-Lampretsa, Sergios Gatidis</p><p>Vision foundation models (FMs) achieve state-of-the-art performance in medical imaging. However, they encode information in abstract latent representations that clinicians cannot interrogate or verify. The goal of this study is to investigate Sparse Autoencoders (SAEs) for replacing opaque FM image representations with human-interpretable, sparse features. We train SAEs on embeddings from BiomedParse (biomedical) and DINOv3 (general-purpose) using 909,873 CT and MRI 2D image slices from the Tota</p><p><strong>Tags:</strong> SAE, features, vision</p>]]></description>
      <pubDate>Tue, 24 Mar 2026 00:00:00 +0000</pubDate>
      <guid>https://arxiv.org/abs/2603.23794v1</guid>
    </item>
  </channel>
</rss>
