Masayoshi Kanato, Hashimoto Masahiro, Toda Naoki, Mori Hirozumi, Kobayashi Goh, Haque Hasnine, So Mizuki, Jinzaki Masahiro
Department of Radiology, School of Medicine, Keio University, 35 Shinanomachi, Shinjuku-ku, Tokyo 160-8582, Tokyo, Japan, 81 3-3353-1211 ext 62477.
GE Healthcare Japan, 4-7-127, Asahigaoka, Hino, Tokyo, Japan.
JMIR AI. 2025 Jul 22;4:e68020. doi: 10.2196/68020.
Ultrasound examinations, while valuable, are time-consuming and often limited in availability. Consequently, many hospitals implement reservation systems; however, these systems typically lack prioritization for examination purposes. Hence, our hospital uses a waitlist system that prioritizes examination requests based on their clinical value when slots become available due to cancellations. This system, however, requires a manual review of examination purposes, which are recorded in free-form text. We hypothesized that artificial intelligence language models could preliminarily estimate the priority of requests before manual reviews.
This study aimed to investigate potential challenges associated with using language models for estimating the priority of medical examination requests and to evaluate the performance of language models in processing Japanese medical texts.
We retrospectively collected ultrasound examination requests from the waitlist system at Keio University Hospital, spanning January 2020 to March 2023. Each request comprised an examination purpose documented by the requesting physician and a 6-tier priority level assigned by a radiologist during the clinical workflow. We fine-tuned JMedRoBERTa, Luke, OpenCalm, and LLaMA2 under two conditions: (1) tuning only the final layer and (2) tuning all layers using either standard backpropagation or low-rank adaptation.
We had 2335 and 204 requests in the training and test datasets post cleaning. When only the final layers were tuned, JMedRoBERTa outperformed the other models (Kendall coefficient=0.225). With full fine-tuning, JMedRoBERTa continued to perform best (Kendall coefficient=0.254), though with reduced margins compared with the other models. The radiologist's retrospective re-evaluation yielded a Kendall coefficient of 0.221.
Language models can estimate the priority of examination requests with accuracy comparable with that of human radiologists. The fine-tuning results indicate that general-purpose language models can be adapted to domain-specific texts (ie, Japanese medical texts) with sufficient fine-tuning. Further research is required to address priority rank ambiguity, expand the dataset across multiple institutions, and explore more recent language models with potentially higher performance or better suitability for this task.
超声检查虽有价值,但耗时且可用性往往有限。因此,许多医院实施预约系统;然而,这些系统通常缺乏用于检查目的的优先级划分。因此,我院使用候补名单系统,在因取消预约而有空档时,根据检查请求的临床价值对其进行优先级排序。然而,该系统需要人工审查以自由格式文本记录的检查目的。我们假设人工智能语言模型可以在人工审查之前初步估计请求的优先级。
本研究旨在调查使用语言模型估计医学检查请求优先级相关的潜在挑战,并评估语言模型处理日语医学文本的性能。
我们回顾性收集了庆应义塾大学医院候补名单系统中2020年1月至2023年3月期间的超声检查请求。每个请求包括请求医生记录的检查目的以及放射科医生在临床工作流程中分配的6级优先级。我们在两种条件下对JMedRoBERTa、Luke、OpenCalm和LLaMA2进行微调:(1)仅微调最后一层;(2)使用标准反向传播或低秩适应微调所有层。
清理后,训练数据集和测试数据集中分别有2335个和204个请求。仅微调最后一层时,JMedRoBERTa的表现优于其他模型(肯德尔系数=0.225)。进行完全微调时,JMedRoBERTa继续表现最佳(肯德尔系数=0.254),不过与其他模型相比优势有所减小。放射科医生的回顾性重新评估得出的肯德尔系数为0.221。
语言模型能够以与放射科医生相当的准确性估计检查请求的优先级。微调结果表明,经过充分微调,通用语言模型可以适应特定领域文本(即日语医学文本)。需要进一步研究以解决优先级排名的模糊性问题,跨多个机构扩展数据集,并探索性能可能更高或更适合此任务的更新语言模型。