Menezes Maria Clara Saad, Hoffmann Alexander F, Tan Amelia L M, Nalbandyan Mariné, Omenn Gilbert S, Mazzotti Diego R, Hernández-Arango Alejandro, Visweswaran Shyam, Venkatesh Shruthi, Mandl Kenneth D, Bourgeois Florence T, Lee James W K, Makmur Andrew, Hanauer David A, Semanik Michael G, Kerivan Lauren T, Hill Terra, Forero Julian, Restrepo Carlos, Vigna Matteo, Ceriana Piero, Abu-El-Rub Noor, Avillach Paul, Bellazzi Riccardo, Callaci Thomas, Gutiérrez-Sacristán Alba, Malovini Alberto, Mathew Jomol P, Morris Michele, Murthy Venkatesh L, Buonocore Tommaso M, Parimbelli Enea, Patel Lav P, Sáez Carlos, Samayamuthu Malarkodi Jebathilagam, Thompson Jeffrey A, Tibollo Valentina, Xia Zongqi, Kohane Isaac S
Department of Biomedical Informatics, Medical School, Harvard University, Boston, MA, USA; Department of Internal Medicine, University of Texas at Southwestern, Dallas, TX, USA.
Department of Biomedical Informatics, Medical School, Harvard University, Boston, MA, USA.
Lancet Digit Health. 2025 Jan;7(1):e35-e43. doi: 10.1016/S2589-7500(24)00246-2.
Patient notes contain substantial information but are difficult for computers to analyse due to their unstructured format. Large-language models (LLMs), such as Generative Pre-trained Transformer 4 (GPT-4), have changed our ability to process text, but we do not know how effectively they handle medical notes. We aimed to assess the ability of GPT-4 to answer predefined questions after reading medical notes in three different languages.
For this retrospective model-evaluation study, we included eight university hospitals from four countries (ie, the USA, Colombia, Singapore, and Italy). Each site submitted seven de-identified medical notes related to seven separate patients to the coordinating centre between June 1, 2023, and Feb 28, 2024. Medical notes were written between Feb 1, 2020, and June 1, 2023. One site provided medical notes in Spanish, one site provided notes in Italian, and the remaining six sites provided notes in English. We included admission notes, progress notes, and consultation notes. No discharge summaries were included in this study. We advised participating sites to choose medical notes that, at time of hospital admission, were for patients who were male or female, aged 18-65 years, had a diagnosis of obesity, had a diagnosis of COVID-19, and had submitted an admission note. Adherence to these criteria was optional and participating sites randomly chose which medical notes to submit. When entering information into GPT-4, we prepended each medical note with an instruction prompt and a list of 14 questions that had been chosen a priori. Each medical note was individually given to GPT-4 in its original language and in separate sessions; the questions were always given in English. At each site, two physicians independently validated responses by GPT-4 and responded to all 14 questions. Each pair of physicians evaluated responses from GPT-4 to the seven medical notes from their own site only. Physicians were not masked to responses from GPT-4 before providing their own answers, but were masked to responses from the other physician.
We collected 56 medical notes, of which 42 (75%) were in English, seven (13%) were in Italian, and seven (13%) were in Spanish. For each medical note, GPT-4 responded to 14 questions, resulting in 784 responses. In 622 (79%, 95% CI 76-82) of 784 responses, both physicians agreed with GPT-4. In 82 (11%, 8-13) responses, only one physician agreed with GPT-4. In the remaining 80 (10%, 8-13) responses, neither physician agreed with GPT-4. Both physicians agreed with GPT-4 more often for medical notes written in Spanish (86 [88%, 95% CI 79-93] of 98 responses) and Italian (82 [84%, 75-90] of 98 responses) than in English (454 [77%, 74-80] of 588 responses).
The results of our model-evaluation study suggest that GPT-4 is accurate when analysing medical notes in three different languages. In the future, research should explore how LLMs can be integrated into clinical workflows to maximise their use in health care.
None.
患者记录包含大量信息,但由于其非结构化格式,计算机难以分析。大型语言模型(LLMs),如生成式预训练变换器4(GPT-4),改变了我们处理文本的能力,但我们不知道它们处理医疗记录的效果如何。我们旨在评估GPT-4在阅读三种不同语言的医疗记录后回答预定义问题的能力。
对于这项回顾性模型评估研究,我们纳入了来自四个国家(即美国、哥伦比亚、新加坡和意大利)的八所大学医院。每个站点在2023年6月1日至2024年2月28日期间向协调中心提交了与七名不同患者相关的七份去识别化医疗记录。医疗记录的撰写时间为2020年2月1日至2023年6月1日。一个站点提供西班牙语的医疗记录,一个站点提供意大利语的记录,其余六个站点提供英语的记录。我们纳入了入院记录、病程记录和会诊记录。本研究未包括出院小结。我们建议参与站点选择在医院入院时针对年龄在18至65岁、患有肥胖症、患有2019冠状病毒病(COVID-19)且已提交入院记录的男性或女性患者的医疗记录。是否遵循这些标准是可选的,参与站点随机选择要提交的医疗记录。在将信息输入GPT-4时,我们在每份医疗记录前加上一个指令提示和一份事先选定的14个问题的列表。每份医疗记录以其原始语言并在单独的会话中分别提供给GPT-4;问题始终以英语给出。在每个站点,两名医生独立验证GPT-4的回答并回答所有14个问题。每对医生仅评估GPT-4对来自他们自己站点的七份医疗记录的回答。医生在给出自己的答案之前未对GPT-4 的回答进行盲法处理,但对另一位医生的回答进行了盲法处理。
我们收集了56份医疗记录,其中42份(75%)是英语的,7份(13%)是意大利语的,7份(13%)是西班牙语的。对于每份医疗记录,GPT-4回答了14个问题,共产生784个回答。在784个回答中,622个(79%,95%CI 76 - 82)回答两名医生都与GPT-4意见一致。在82个(11%,8 - 13)回答中,只有一名医生与GPT-4意见一致。在其余80个(10%,8 - 13)回答中,两名医生都不同意GPT-4的回答。与英语医疗记录(588个回答中的454个[77%,74 - 80])相比,两名医生对西班牙语(98个回答中的86个[88%,95%CI 79 - 93])和意大利语(98个回答中的82个[84%,75 - 90])医疗记录与GPT-4意见一致的情况更多。
我们的模型评估研究结果表明,GPT-4在分析三种不同语言的医疗记录时是准确的。未来,研究应探索如何将大型语言模型整合到临床工作流程中,以最大限度地在医疗保健中使用它们。
无。