Cui Wanting, Finkelstein Joseph
Department of Biomedical Informatics, University of Utah, Salt Lake City, UT.
Stud Health Technol Inform. 2025 Aug 7;329:1840-1841. doi: 10.3233/SHTI251241.
This study explored the potential of LLMs, such as ClinicalBERT and GPT-4, to identify potential diagnoses using early clinical notes from the MIMIC-III dataset. We compared these models across four conditions: circulatory system diseases, respiratory system diseases, septicemia, and pneumonia. ClinicalBERT consistently outperformed the GPT models, with its highest F1-score of 0.952 for respiratory system diseases. The GPT models, while showing high recall, had lower precision, with the highest F1-score of 0.784 achieved by the GPT binary voting method. ClinicalBERT demonstrated strong precision and F1-scores, while GPT-4 excelled in recall.
本研究探讨了诸如ClinicalBERT和GPT-4等语言模型利用MIMIC-III数据集中的早期临床记录来识别潜在诊断的潜力。我们在四种病症上对这些模型进行了比较:循环系统疾病、呼吸系统疾病、败血症和肺炎。ClinicalBERT始终优于GPT模型,其在呼吸系统疾病上的最高F1分数为0.952。GPT模型虽然召回率高,但精度较低,通过GPT二元投票法获得的最高F1分数为0.784。ClinicalBERT展示出了较高的精度和F1分数,而GPT-4在召回率方面表现出色。