Edwards Aleksandra, Pardiñas Antonio F, Kirov George, Rees Elliott, Camacho-Collados Jose
School of Computer Science and Informatics, Cardiff University, Cathays, Cardiff, CF24 4AG, United Kingdom, 1 029 2087 4812.
School of Medicine, Centre for Neuropsychiatric Genetics and Genomics, Division of Psychological Medicine and Clinical Neurosciences, Cardiff University, Cardiff, United Kingdom.
JMIR AI. 2025 Sep 18;4:e72256. doi: 10.2196/72256.
Free-text clinical data are unstructured and narrative in nature, providing a rich source of patient information, but extracting research-quality clinical phenotypes from these data remains a challenge. Manually reviewing and extracting clinical phenotypes from free-text patient notes is a time-consuming process and not suitable for large-scale datasets. On the other hand, automatically extracting clinical phenotypes can be challenging because medical researchers lack gold-standard annotated references and other purpose-built resources, including software. Recent large language models (LLMs) can understand natural language instructions, which help them adapt to different domains and tasks without the need for specific training data. This makes them suitable for clinical applications, though their use in this field is limited.
We aimed to develop an LLM pipeline based on the few-shot learning framework that could extract clinical information from free-text clinical summaries. We assessed the performance of this pipeline for classifying individuals with confirmed or suspected comorbid intellectual disability (ID) from clinical summaries of patients with severe mental illness and performed genetic validation of the results by testing whether individuals with LLM-defined ID carried more genetic variants known to confer risk of ID when compared with individuals without LLM-defined ID.
We developed novel approaches for performing classification, based on an intermediate information extraction (IE) step and human-in-the-loop techniques. We evaluated two models: Fine-Tuned Language Text-To-Text Transfer Transformer (Flan-T5) and Large Language Model Architecture (LLaMA). The dataset comprised 1144 free-text clinical summaries, of which 314 were manually annotated and used as a gold standard for evaluating automated methods. We also used published genetic data from 547 individuals to perform a genetic validation of the classification results; Firth's penalized logistic regression framework was used to test whether individuals with LLM-defined ID carry significantly more de novo variants in known developmental disorder risk genes than individuals without LLM-defined ID.
The results demonstrate that a 2-stage approach, combining IE with manual validation, can effectively identify individuals with suspected IDs from free-text patient records, requiring only a single training example per classification label. The best-performing method based on the Flan-T5 model and incorporating the IE step achieved an F1-score of 0.867. Individuals classified as having ID by the best performing model were significantly enriched for de novo variants in known developmental disorder risk genes (odds ratio 29.1, 95% CI 7.36-107; P=2.1×10-5).
LLMs and in-context learning techniques combined with human-in-the-loop approaches can be highly beneficial for extraction and categorization of information from free-text clinical data. In this proof-of-concept study, we show that LLMs can be used to identify individuals with a severe mental illness who also have suspected ID, which is a biologically and clinically meaningful subgroup of patients.
自由文本临床数据本质上是非结构化的叙述性内容,提供了丰富的患者信息来源,但从这些数据中提取具有研究质量的临床表型仍然是一项挑战。人工审阅和从自由文本患者记录中提取临床表型是一个耗时的过程,不适用于大规模数据集。另一方面,自动提取临床表型可能具有挑战性,因为医学研究人员缺乏金标准注释参考文献和其他专门构建的资源,包括软件。最近的大语言模型(LLMs)能够理解自然语言指令,这有助于它们适应不同领域和任务,而无需特定的训练数据。这使其适用于临床应用,尽管其在该领域的使用有限。
我们旨在开发一种基于少样本学习框架的大语言模型流程,该流程能够从自由文本临床摘要中提取临床信息。我们评估了该流程从严重精神疾病患者的临床摘要中对确诊或疑似合并智力残疾(ID)个体进行分类的性能,并通过测试与无大语言模型定义的ID的个体相比,有大语言模型定义的ID的个体是否携带更多已知会导致ID风险的遗传变异,对结果进行了基因验证。
我们基于中间信息提取(IE)步骤和人工参与技术开发了用于进行分类的新方法。我们评估了两个模型:微调语言文本到文本转移变换器(Flan-T5)和大语言模型架构(LLaMA)。数据集包括1144篇自由文本临床摘要,其中314篇经过人工注释并用作评估自动化方法的金标准。我们还使用了来自547名个体的已发表基因数据对分类结果进行基因验证;使用Firth惩罚逻辑回归框架来测试与无大语言模型定义的ID的个体相比,有大语言模型定义的ID的个体在已知发育障碍风险基因中是否携带显著更多的新生变异。
结果表明,将IE与人工验证相结合的两阶段方法可以有效地从自由文本患者记录中识别出疑似ID的个体,每个分类标签仅需一个训练示例。基于Flan-T5模型并纳入IE步骤的最佳性能方法的F1分数达到了0.867。被最佳性能模型分类为患有ID的个体在已知发育障碍风险基因中的新生变异显著富集(优势比29.1,95%置信区间7.36 - 107;P = 2.1×10⁻⁵)。
大语言模型和上下文学习技术与人工参与方法相结合,对于从自由文本临床数据中提取和分类信息可能非常有益。在这项概念验证研究中,我们表明大语言模型可用于识别患有严重精神疾病且疑似患有ID的个体,这是一个在生物学和临床上具有意义的患者亚组。