Sharabiani Mansour, Mahani Alireza, Bottle Alex, Srinivasan Yadav, Issitt Richard, Stoica Serban
School of Public Health, Imperial College London, London, UK.
New York Stock Exchange, New York, United States.
Sci Rep. 2025 Jul 1;15(1):20847. doi: 10.1038/s41598-025-04651-8.
The emergence of large language models (LLMs) opens new horizons to leverage, often unused, information in clinical text. Our study aims to capitalise on this new potential. Specifically, we examine the utility of text embeddings generated by LLMs in predicting postoperative acute kidney injury (AKI) in paediatric cardiopulmonary bypass (CPB) patients using electronic health record (EHR) text, and propose methods for explaining their output. AKI could be a serious complication in paediatric CPB and its accurate prediction can significantly improve patient outcomes by enabling timely interventions. We evaluate various text embedding algorithms such as Doc2Vec, top-performing sentence transformers on Hugging Face, and commercial LLMs from Google and OpenAI. We benchmark the cross-validated performance of these 'AI models' against a 'baseline model' as well as an established clinically-defined 'expert model'. The baseline model includes structured features, i.e., patient gender, age, height, body mass index and length of operation. The majority of AI models surpass, not only the baseline model, but also the expert model. An ensemble of AI and clinical-expert models improves discriminative performance by 23% compared to the baseline model. Consistency of patient clusters formed from AI-generated embeddings with clinical-expert clusters-measured via the adjusted rand index and adjusted mutual information metrics-illustrates the medical validity of LLM embeddings. We create a reverse mapping from the numeric embedding space to the natural-language domain via the embedding-based clusters, generating medical labels for the clusters in the process. We also use text-generating LLMs to summarise the differences between AI and expert clusters. Such 'explainability' outputs can increase medical practitioners' trust in the AI applications, and help generate new hypotheses, e.g., by studying the association of cluster memberships and outcomes of interest.
大语言模型(LLMs)的出现为利用临床文本中通常未被使用的信息开辟了新的视野。我们的研究旨在利用这一新潜力。具体而言,我们使用电子健康记录(EHR)文本,研究大语言模型生成的文本嵌入在预测小儿体外循环(CPB)患者术后急性肾损伤(AKI)方面的效用,并提出解释其输出的方法。AKI可能是小儿CPB中的一种严重并发症,其准确预测可通过及时干预显著改善患者预后。我们评估了各种文本嵌入算法,如Doc2Vec、Hugging Face上表现最佳的句子变换器,以及谷歌和OpenAI的商业大语言模型。我们将这些“人工智能模型”的交叉验证性能与“基线模型”以及既定的临床定义“专家模型”进行基准比较。基线模型包括结构化特征,即患者性别、年龄、身高、体重指数和手术时长。大多数人工智能模型不仅超过了基线模型,还超过了专家模型。与基线模型相比,人工智能和临床专家模型的组合将判别性能提高了23%。通过调整兰德指数和调整互信息指标衡量,由人工智能生成的嵌入形成的患者聚类与临床专家聚类的一致性说明了大语言模型嵌入的医学有效性。我们通过基于嵌入的聚类创建从数字嵌入空间到自然语言领域的反向映射,在此过程中为聚类生成医学标签。我们还使用文本生成大语言模型来总结人工智能聚类和专家聚类之间的差异。这种“可解释性”输出可以增加医学从业者对人工智能应用的信任,并有助于生成新的假设,例如通过研究聚类成员与感兴趣的结果之间的关联。