I am a PhD candidate researching the inner workings of Large Language Models. My goal is to bridge the gap between model capability and human understanding.
Specifically, I work on Mechanistic Interpretability-tracing and editing internal representations to make LLMs more controllable and transparent. I ask questions like: Where is knowledge stored? and How can we causally edit model behavior?
Language models used in retrieval-augmented settings must arbitrate between parametric knowledge stored in their weights and contextual information in the prompt. This work presents a mechanistic study of that choice by extracting an \empharbitration vector from model activations on a curated dataset designed to disentangle (i) irrelevant contexts that elicit parametric recall and (ii) relevant but false contexts that elicit copying. The vector is computed as the residual-stream centroid difference between these regimes across 27 relations, and is injected as an additive intervention at selected layers and token spans to steer behavior in two directions: Copy→Recall (suppressing context use) and Recall→Copy (inducing the model to copy any token from the context). Experiments on two architectures (decoder-only and encoder/decoder) and two open-domain QA benchmarks show consistent behavior shifts under moderate scaling while monitoring accuracy and fluency. Mechanistic analyses of attention routing, MLP contributions, and layer-wise probability trajectories reveal an asymmetry: inducing copying is an easy “reactivation” process that can be triggered at different locations in the input, while restoring recall is a “suppression” process that is more fragile and strongly tied to object-token interventions.
@misc{farahani-penzkofer-johansson-2026-copy,title={To Copy or Not to Copy: Copying Is Easier to Induce Than Recall},author={Farahani, Mehrdad and Penzkofer, Franziska and Johansson, Richard},year={2026},url={https://arxiv.org/abs/2601.12075},eprint={2601.12075},archiveprefix={arXiv},primaryclass={cs.CL},}
Generative language models often struggle with specialized or less-discussed knowledge. A potential solution is found in Retrieval-Augmented Generation (RAG) models which act like retrieving information before generating responses. In this study, we explore how the Atlas approach, a RAG model, decides between what it already knows (parametric) and what it retrieves (non-parametric). We use causal mediation analysis and controlled experiments to examine how internal representations influence information processing. Our findings disentangle the effects of parametric knowledge and the retrieved context. They indicate that in cases where the model can choose between both types of information (parametric and non-parametric), it relies more on the context than the parametric knowledge. Furthermore, the analysis investigates the computations involved in \textithow the model uses the information from the context. We find that multiple mechanisms are active within the model and can be detected with mediation analysis: first, the decision of \textitwhether the context is relevant, and second, how the encoder computes output representations to support copying when relevant.
@inproceedings{farahani-johansson-2024-deciphering,title={Deciphering the Interplay of Parametric and Non-parametric Memory in Retrieval-augmented Language Models},author={Farahani, Mehrdad and Johansson, Richard},year={2024},month=nov,booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing},publisher={Association for Computational Linguistics},address={Miami, Florida, USA},pages={16966--16977},doi={10.18653/v1/2024.emnlp-main.943},url={https://aclanthology.org/2024.emnlp-main.943/},editor={Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung},}