Bazoge Adrien, Wargny Matthieu, Constant Dit Beaufils Pacôme, Morin Emmanuel, Daille Béatrice, Gourraud Pierre-Antoine, Hadjadj Samy
Nantes Université, CHU Nantes, Pôle Hospitalo-Universitaire 11: Santé Publique, Clinique des données, INSERM, CIC 1413, F-44000, Nantes, France; Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, F-44000, Nantes, France.
Nantes Université, CHU Nantes, Pôle Hospitalo-Universitaire 11: Santé Publique, Clinique des données, INSERM, CIC 1413, F-44000, Nantes, France; Nantes Université, CHU Nantes, Département d'Endocrinologie, Diabétologie et Nutrition, l'institut du thorax, Inserm, CNRS, Hôpital Guillaume et René Laennec, F-44000, Nantes, France.
Comput Biol Med. 2025 Sep;195:110609. doi: 10.1016/j.compbiomed.2025.110609. Epub 2025 Jun 19.
Understanding acute heart failure (AHF) remains a significant challenge, as many clinical details are recorded in unstructured text rather than structured data in electronic health records (EHRs). In this study, we explored the use of large language models (LLMs) to automatically identify AHF hospitalizations and extract accurate AHF-related clinical information from clinical notes. Based on clinical notes from the Nantes University Hospital in France, we used a general-purpose LLM, Qwen2-7B, and evaluated its performance against a French biomedical pretrained model, DrLongformer. We explored supervised fine-tuning and in-context learning techniques, such as few-shot and chain-of-thought prompting, and performed an ablation study to analyze the impact of data volume and annotation characteristics on model performance. Our results demonstrated that DrLongformer achieved superior performance in classifying AHF hospitalizations, with an F1 score of 0.878 compared to 0.80 for Qwen2-7B, and similarly outperformed in extracting most of the clinical information. However, Qwen2-7B showed better performance in extracting quantitative outcomes when fine-tuned on the training set (typically weight and body mass index, for example). Our ablation study revealed that the number of clinical notes used in training is a significant factor influencing model performance, but improvements plateaued after 250 documents. Additionally, we observed that longer annotations negatively impact model training and downstream performance. The findings highlight the potential of small language models-which can be hosted on-premise in hospitals and integrated with EHRs-to improve real-world data collection and identify complex medical symptoms such as acute heart failure.
理解急性心力衰竭(AHF)仍然是一项重大挑战,因为许多临床细节记录在非结构化文本中,而非电子健康记录(EHR)中的结构化数据。在本研究中,我们探索了使用大语言模型(LLM)自动识别AHF住院病例,并从临床记录中提取准确的AHF相关临床信息。基于法国南特大学医院的临床记录,我们使用了通用LLM Qwen2-7B,并将其性能与法国生物医学预训练模型DrLongformer进行了评估。我们探索了监督微调以及上下文学习技术,如少样本和思维链提示,并进行了对比研究以分析数据量和注释特征对模型性能的影响。我们的结果表明,DrLongformer在AHF住院病例分类方面表现更优,F1分数为0.878,而Qwen2-7B为0.80,在提取大多数临床信息方面同样表现出色。然而,在训练集上进行微调后,Qwen2-7B在提取定量结果(例如典型的体重和体重指数)方面表现更好。我们的对比研究表明,训练中使用的临床记录数量是影响模型性能的一个重要因素,但在250份文档之后性能提升趋于平稳。此外,我们观察到较长的注释会对模型训练和下游性能产生负面影响。这些发现凸显了小型语言模型的潜力——可以在医院内部部署并与EHR集成——以改善实际数据收集并识别诸如急性心力衰竭等复杂的医学症状。