医学大语言模型容易受到数据中毒攻击。

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

Alber Daniel Alexander, Yang Zihao, Alyakin Anton, Yang Eunice, Rai Sumedha, Valliani Aly A, Zhang Jeff, Rosenbaum Gabriel R, Amend-Thomas Ashley K, Kurland David B, Kremer Caroline M, Eremiev Alexander, Negash Bruck, Wiggan Daniel D, Nakatsuka Michelle A, Sangwon Karl L, Neifert Sean N, Khan Hammad A, Save Akshay Vinod, Palla Adhith, Grin Eric A, Hedman Monika, Nasir-Moin Mustafa, Liu Xujin Chris, Jiang Lavender Yao, Mankowski Michal A, Segev Dorry L, Aphinyanaphongs Yindalon, Riina Howard A, Golfinos John G, Orringer Daniel A, Kondziolka Douglas, Oermann Eric Karl

Department of Neurosurgery, NYU Langone Health, New York, NY, USA.

New York University Grossman School of Medicine, New York, NY, USA.

Nat Med. 2025 Feb;31(2):618-626. doi: 10.1038/s41591-024-03445-1. Epub 2025 Jan 8.

The adoption of large language models (LLMs) in healthcare demands a careful analysis of their potential to spread false medical knowledge. Because LLMs ingest massive volumes of data from the open Internet during training, they are potentially exposed to unverified medical knowledge that may include deliberately planted misinformation. Here, we perform a threat assessment that simulates a data-poisoning attack against The Pile, a popular dataset used for LLM development. We find that replacement of just 0.001% of training tokens with medical misinformation results in harmful models more likely to propagate medical errors. Furthermore, we discover that corrupted models match the performance of their corruption-free counterparts on open-source benchmarks routinely used to evaluate medical LLMs. Using biomedical knowledge graphs to screen medical LLM outputs, we propose a harm mitigation strategy that captures 91.9% of harmful content (F1 = 85.7%). Our algorithm provides a unique method to validate stochastically generated LLM outputs against hard-coded relationships in knowledge graphs. In view of current calls for improved data provenance and transparent LLM development, we hope to raise awareness of emergent risks from LLMs trained indiscriminately on web-scraped data, particularly in healthcare where misinformation can potentially compromise patient safety.

在医疗保健领域采用大语言模型（LLMs）需要仔细分析其传播错误医学知识的可能性。由于大语言模型在训练期间会从开放互联网摄取大量数据，它们有可能接触到未经证实的医学知识，其中可能包括故意植入的错误信息。在此，我们进行了一项威胁评估，模拟针对用于大语言模型开发的流行数据集The Pile的数据中毒攻击。我们发现，仅用0.001%的医学错误信息替换训练令牌就会导致有害模型更有可能传播医疗错误。此外，我们发现，在常用于评估医学大语言模型的开源基准测试中，被破坏的模型与未被破坏的模型表现相当。通过使用生物医学知识图谱来筛选医学大语言模型的输出，我们提出了一种减轻危害的策略，该策略能够捕捉91.9%的有害内容（F1 = 85.7%）。我们的算法提供了一种独特的方法，可根据知识图谱中的硬编码关系验证随机生成的大语言模型输出。鉴于当前对改进数据来源和透明的大语言模型开发的呼吁，我们希望提高人们对在网络抓取数据上进行无差别训练的大语言模型所带来的新出现风险的认识，尤其是在医疗保健领域，错误信息可能会对患者安全造成潜在威胁。