利用大语言模型生成肿瘤影像学检查申请的临床病史。

Bhayana Rajesh, Alwahbi Omar, Ladak Aly Muhammad, Deng Yangqing, Basso Dias Adriano, Elbanna Khaled, Abreu Gomez Jorge, Jajodia Ankush, Jhaveri Kartik, Johnson Sarah, Kajal Dilkash, Wang David, Soong Christine, Kielar Ania, Krishna Satheesh

From the Joint Department of Medical Imaging, University Medical Imaging Toronto, Princess Margaret Cancer Centre, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Building, 1st Fl, Toronto, ON, Canada M5G 24C (R.B., O.A., A.B.D., K.E., J.A.G., A.J., K.J., S.J., D.K., D.W., A.K., S.K.); Department of Medicine, University of Toronto, Toronto, Canada (A.M.L.); Department of Biostatistics, University Health Network, Toronto, Canada (Y.D.); and Department of General Internal Medicine, Mount Sinai Hospital, Toronto, Canada (C.S.).

Radiology. 2025 Feb;314(2):e242134. doi: 10.1148/radiol.242134.

Background Clinical information improves imaging interpretation, but physician-provided histories on requisitions for oncologic imaging often lack key details. Purpose To evaluate large language models (LLMs) for automatically generating clinical histories for oncologic imaging requisitions from clinical notes and compare them with original requisition histories. Materials and Methods In total, 207 patients with CT performed at a cancer center from January to November 2023 and with an electronic health record clinical note coinciding with ordering date were randomly selected. A multidisciplinary team informed selection of 10 parameters important for oncologic imaging history, including primary oncologic diagnosis, treatment history, and acute symptoms. Clinical notes were independently reviewed to establish the reference standard regarding presence of each parameter. After prompt engineering with seven patients, GPT-4 (version 0613; OpenAI) was prompted on April 9, 2024, to automatically generate structured clinical histories for the 200 remaining patients. Using the reference standard, LLM extraction performance was calculated (recall, precision, F1 score). LLM-generated and original requisition histories were compared for completeness (proportion including each parameter), and 10 radiologists performed pairwise comparison for quality, preference, and subjective likelihood of harm. Results For the 200 LLM-generated histories, GPT-4 performed well, extracting oncologic parameters from clinical notes (F1 = 0.983). Compared with original requisition histories, LLM-generated histories more frequently included parameters critical for radiologist interpretation, including primary oncologic diagnosis (99.5% vs 89% [199 and 178 of 200 histories, respectively]; < .001), acute or worsening symptoms (15% vs 4% [29 and seven of 200]; < .001), and relevant surgery (61% vs 12% [122 and 23 of 200]; < .001). Radiologists preferred LLM-generated histories for imaging interpretation (89% vs 5%, 7% equal; < .001), indicating they would enable more complete interpretation (86% vs 0%, 15% equal; < .001) and have a lower likelihood of harm (3% vs 55%, 42% neither; < .001). Conclusion An LLM enabled accurate automated clinical histories for oncologic imaging from clinical notes. Compared with original requisition histories, LLM-generated histories were more complete and were preferred by radiologists for imaging interpretation and perceived safety. © RSNA, 2025 See also the editorial by Tavakoli and Kim in this issue.

背景临床信息有助于改善影像解读，但医生在肿瘤影像检查申请单上提供的病史往往缺乏关键细节。目的评估大语言模型（LLM）能否根据临床记录自动生成肿瘤影像检查申请单的临床病史，并将其与原始申请单病史进行比较。材料与方法总共随机选取了2023年1月至11月在一家癌症中心接受CT检查且电子健康记录临床记录与检查订单日期相符的207例患者。一个多学科团队确定了10个对肿瘤影像病史很重要的参数，包括原发性肿瘤诊断、治疗史和急性症状。对临床记录进行独立审查以确定每个参数是否存在的参考标准。在对7例患者进行提示工程后，于2024年4月9日促使GPT-4（版本0613；OpenAI）为其余200例患者自动生成结构化临床病史。根据参考标准计算LLM的提取性能（召回率、精确率、F1分数）。比较LLM生成的病史和原始申请单病史的完整性（包括每个参数的比例），10名放射科医生对质量、偏好和主观伤害可能性进行成对比较。结果对于200份LLM生成的病史，GPT-4表现良好，能从临床记录中提取肿瘤参数（F1 = 0.983）。与原始申请单病史相比，LLM生成的病史更频繁地包含对放射科医生解读至关重要的参数，包括原发性肿瘤诊断（99.5%对89%[分别为200份病史中的199份和178份]；P <.001）、急性或恶化症状（15%对4%[200份中的29份和7份]；P <.001）以及相关手术（61%对12%[200份中的122份和23份]；P <.001）。放射科医生在影像解读方面更喜欢LLM生成的病史（89%对5%，7%认为无差异；P <.001），表明这些病史能使解读更完整（86%对0%，15%认为无差异；P <.001）且伤害可能性更低（3%对55%，42%认为两者均无差异；P <.001）。结论一个LLM能够根据临床记录为肿瘤影像准确自动生成临床病史。与原始申请单病史相比LLM生成的病史更完整，在影像解读和感知安全性方面更受放射科医生青睐。© RSNA，2025 另见本期Tavakoli和Kim的社论。