Ji-An Li, Mattar Marcelo G, Xiong Hua-Dong, Benna Marcus K, Wilson Robert C
Neurosciences Graduate Program University of California San Diego.
Department of Psychology New York University.
ArXiv. 2025 May 19:arXiv:2505.13763v1.
Large language models (LLMs) can sometimes report the strategies they actually use to solve tasks, but they can also fail to do so. This suggests some degree of metacognition - the capacity to monitor one's own cognitive processes for subsequent reporting and self-control. Metacognitive abilities enhance AI capabilities but raise safety concerns, as models might obscure their internal processes to evade neural-activation-based oversight mechanisms designed to detect harmful behaviors. Given society's increased reliance on these models, it is critical that we understand the limits of their metacognitive abilities, particularly their ability to monitor their internal activations. To address this, we introduce a neuroscience-inspired paradigm designed to quantify the ability of LLMs to explicitly and their activation patterns. By presenting models with sentence-label pairs where labels correspond to sentence-elicited internal activations along specific directions in the neural representation space, we demonstrate that LLMs can learn to report and control these activations. The performance varies with several factors: the number of example pairs provided, the semantic interpretability of the target neural direction, and the variance explained by that direction. These results reveal a "metacognitive space" with dimensionality much lower than the model's neural space, suggesting LLMs can monitor only a subset of their neural mechanisms. Our findings provide empirical evidence quantifying metacognitive capabilities in LLMs, with significant implications for AI safety.
大语言模型(LLMs)有时可以报告它们实际用于解决任务的策略,但也可能做不到这一点。这表明存在一定程度的元认知——即监控自己的认知过程以便后续报告和自我控制的能力。元认知能力增强了人工智能的能力,但也引发了安全担忧,因为模型可能会掩盖其内部过程,以逃避旨在检测有害行为的基于神经激活的监督机制。鉴于社会对这些模型的依赖日益增加,我们了解它们元认知能力的局限性至关重要,特别是它们监控内部激活的能力。为了解决这个问题,我们引入了一种受神经科学启发的范式,旨在量化大语言模型明确其激活模式的能力。通过向模型呈现句子-标签对,其中标签对应于沿着神经表征空间中特定方向的句子引发的内部激活,我们证明大语言模型可以学会报告和控制这些激活。性能会因几个因素而有所不同:提供的示例对数量、目标神经方向的语义可解释性以及该方向所解释的方差。这些结果揭示了一个维度远低于模型神经空间的“元认知空间”,这表明大语言模型只能监控其神经机制的一个子集。我们的发现为量化大语言模型中的元认知能力提供了实证证据,对人工智能安全具有重要意义。