Suppr超能文献

用于从电子健康记录中提取精神疾病表型的大语言模型

Large Language Models for Psychiatric Phenotype Extraction from Electronic Health Records.

作者信息

Frydman-Gani Clara, Arias Alejandro, Vallejo Maria Perez, Londoño Martínez John Daniel, Valencia-Echeverry Johanna, Castaño Mauricio, Bui Alex A T, Freimer Nelson B, Lopez-Jaramillo Carlos, Olde Loohuis Loes M

机构信息

Center for Neurobehavioral Genetics, Semel Institute for Neuroscience and Human Behavior, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, USA.

Department of Mental Health and Human Behavior, University of Caldas, Manizales, Colombia.

出版信息

medRxiv. 2025 Aug 12:2025.08.07.25333172. doi: 10.1101/2025.08.07.25333172.

Abstract

The accurate detection of clinical phenotypes from electronic health records (EHRs) is pivotal for advancing large-scale genetic and longitudinal studies in psychiatry. Free-text clinical notes are an essential source of symptom-level information, particularly in psychiatry. However, the automated extraction of symptoms from clinical text remains challenging. Here, we tested 11 open-source generative large language models (LLMs) for their ability to detect 109 psychiatric phenotypes from clinical text, using annotated EHR notes from a psychiatric clinic in Colombia. The LLMs were evaluated both "out-of-the-box" and after fine-tuning, and compared against a traditional natural language processing (tNLP) method developed from the same data. We show that while base LLM performance was poor to moderate (0.2-0.6 macro-F1 for zero-shot; 0.2-0.74 macro-F1 for few shot), it improved significantly after fine-tuning (0.75-0.86 macro-F1), with several fine-tuned LLMs outperforming the tNLP method. In total, 100 phenotypes could be reliably detected (F1>0.8) using either a fine-tuned LLM or tNLP. To generate a fine-tuned LLM that can be shared with the scientific and medical community, we created a fully synthetic dataset free of patient information but based on original annotations. We fine-tuned a top-performing LLM on this data, creating "Mistral-small-psych", an LLM that can detect psychiatric phenotypes from Spanish text with performance comparable to that of LLMs trained on real EHR data (macro-F1=0.79). Finally, the fine-tuned LLMs underwent an external validation using data from a large psychiatric hospital in Colombia, the Hospital Mental de Antioquia, highlighting that most LLMs generalized well (0.02-0.16 point loss in macro-F1). Our study underscores the value of domain-specific adaptation of LLMs and introduces a new model for accurate psychiatric phenotyping in Spanish text, paving the way for global precision psychiatry.

摘要

从电子健康记录(EHR)中准确检测临床表型对于推进精神病学领域的大规模基因研究和纵向研究至关重要。自由文本临床记录是症状级信息的重要来源,在精神病学领域尤为如此。然而,从临床文本中自动提取症状仍然具有挑战性。在此,我们使用来自哥伦比亚一家精神病诊所的带注释的EHR记录,测试了11个开源生成式大语言模型(LLM)从临床文本中检测109种精神疾病表型的能力。这些LLM在“开箱即用”和微调后均进行了评估,并与基于相同数据开发的传统自然语言处理(tNLP)方法进行了比较。我们发现,虽然基础LLM的性能较差至中等(零样本时的宏F1为0.2 - 0.6;少样本时的宏F1为0.2 - 0.74),但在微调后有显著改善(宏F1为0.75 - 0.86),一些微调后的LLM性能超过了tNLP方法。总体而言,使用微调后的LLM或tNLP能够可靠地检测出100种表型(F1>0.8)。为了生成一个可以与科学界和医学界共享的微调LLM,我们创建了一个完全合成的数据集,该数据集不含患者信息,但基于原始注释。我们在这些数据上对性能最佳的LLM进行了微调,创建了“Mistral-small-psych”,这是一个能够从西班牙语文本中检测精神疾病表型的LLM,其性能与在真实EHR数据上训练的LLM相当(宏F1 = 0.79)。最后,使用来自哥伦比亚一家大型精神病医院——安蒂奥基亚精神医院的数据对微调后的LLM进行了外部验证,结果表明大多数LLM具有良好的泛化能力(宏F1损失0.02 - 0.16分)。我们的研究强调了LLM进行特定领域适配的价值,并引入了一种用于准确识别西班牙语文本中精神疾病表型的新模型,为全球精准精神病学铺平了道路。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aa19/12363723/b0842a5bf01c/nihpp-2025.08.07.25333172v1-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验