Yamagishi Yosuke, Nakamura Yuta, Kikuchi Tomohiro, Sonoda Yuki, Hirakawa Hiroshi, Kano Shintaro, Nakamura Satoshi, Hanaoka Shouhei, Yoshikawa Takeharu, Abe Osamu
Division of Radiology and Biomedical Engineering, Graduate School of Medicine, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8655, Japan, 81 3-3815-5411.
Department of Computational Diagnostic Radiology and Preventive Medicine, The University of Tokyo Hospital, Tokyo, Japan.
JMIR Med Inform. 2025 Aug 28;13:e71137. doi: 10.2196/71137.
Recent advances in large language models have highlighted the need for high-quality multilingual medical datasets. Although Japan is a global leader in computed tomography (CT) scanner deployment and use, the absence of large-scale Japanese radiology datasets has hindered the development of specialized language models for medical imaging analysis. Despite the emergence of multilingual models and language-specific adaptations, the development of Japanese-specific medical language models has been constrained by a lack of comprehensive datasets, particularly in radiology.
This study aims to address this critical gap in Japanese medical natural language processing resources, for which a comprehensive Japanese CT report dataset was developed through machine translation, to establish a specialized language model for structured classification. In addition, a rigorously validated evaluation dataset was created through expert radiologist refinement to ensure a reliable assessment of model performance.
We translated the CT-RATE dataset (24,283 CT reports from 21,304 patients) into Japanese using GPT-4o mini. The training dataset consisted of 22,778 machine-translated reports, and the validation dataset included 150 reports carefully revised by radiologists. We developed CT-BERT-JPN, a specialized Bidirectional Encoder Representations from Transformers (BERT) model for Japanese radiology text, based on the "tohoku-nlp/bert-base-japanese-v3" architecture, to extract 18 structured findings from reports. Translation quality was assessed with Bilingual Evaluation Understudy (BLEU) and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) scores and further evaluated by radiologists in a dedicated human-in-the-loop experiment. In that experiment, each of a randomly selected subset of reports was independently reviewed by 2 radiologists-1 senior (postgraduate year [PGY] 6-11) and 1 junior (PGY 4-5)-using a 5-point Likert scale to rate: (1) grammatical correctness, (2) medical terminology accuracy, and (3) overall readability. Inter-rater reliability was measured via quadratic weighted kappa (QWK). Model performance was benchmarked against GPT-4o using accuracy, precision, recall, F1-score, ROC (receiver operating characteristic)-AUC (area under the curve), and average precision.
General text structure was preserved (BLEU: 0.731 findings, 0.690 impression; ROUGE: 0.770-0.876 findings, 0.748-0.857 impression), though expert review identified 3 categories of necessary refinements-contextual adjustment of technical terms, completion of incomplete translations, and localization of Japanese medical terminology. The radiologist-revised translations scored significantly higher than raw machine translations across all dimensions, and all improvements were statistically significant (P<.001). CT-BERT-JPN outperformed GPT-4o on 11 of 18 findings (61%), achieving perfect F1-scores for 4 conditions and F1-score >0.95 for 14 conditions, despite varied sample sizes (7-82 cases).
Our study established a robust Japanese CT report dataset and demonstrated the effectiveness of a specialized language model in structured classification of findings. This hybrid approach of machine translation and expert validation enabled the creation of large-scale datasets while maintaining high-quality standards. This study provides essential resources for advancing medical artificial intelligence research in Japanese health care settings, using datasets and models publicly available for research to facilitate further advancement in the field.
大语言模型的最新进展凸显了对高质量多语言医学数据集的需求。尽管日本在计算机断层扫描(CT)扫描仪的部署和使用方面处于全球领先地位,但缺乏大规模的日本放射学数据集阻碍了用于医学影像分析的专业语言模型的开发。尽管出现了多语言模型和特定语言的改编,但特定于日语的医学语言模型的开发一直受到缺乏综合数据集的限制,尤其是在放射学领域。
本研究旨在填补日本医学自然语言处理资源中的这一关键空白,为此通过机器翻译开发了一个综合的日语CT报告数据集,以建立一个用于结构化分类的专业语言模型。此外,通过放射科专家的完善创建了一个经过严格验证的评估数据集,以确保对模型性能进行可靠评估。
我们使用GPT-4o mini将CT-RATE数据集(来自21304名患者的24283份CT报告)翻译成日语。训练数据集由22778份机器翻译报告组成,验证数据集包括150份经过放射科医生仔细修订的报告。我们基于“tohoku-nlp/bert-base-japanese-v3”架构开发了CT-BERT-JPN,这是一种专门用于日语放射学文本的双向编码器表征来自变换器(BERT)模型,用于从报告中提取18个结构化结果。使用双语评估辅助工具(BLEU)和面向召回率的辅助工具进行摘要评估(ROUGE)分数评估翻译质量,并在专门的人工参与实验中由放射科医生进一步评估。在该实验中,随机选择的报告子集中的每份报告由2名放射科医生独立评审——1名高级医生(研究生第6 - 11年)和1名初级医生(研究生第4 - 5年)——使用5点李克特量表进行评分:(1)语法正确性,(2)医学术语准确性,以及(3)整体可读性。通过二次加权卡帕(QWK)测量评分者间信度。使用准确率、精确率、召回率、F1分数、ROC(接收者操作特征)- AUC(曲线下面积)和平均精确率将模型性能与GPT-4o进行基准比较。
尽管专家评审确定了3类必要的改进——技术术语的上下文调整、不完整翻译的完成以及日语医学术语的本地化,但一般文本结构得以保留(BLEU:结果0.731,印象0.690;ROUGE:结果0.770 - 0.876,印象0.748 - 0.857)。在所有维度上,经放射科医生修订的翻译得分显著高于原始机器翻译,且所有改进均具有统计学意义(P <.001)。尽管样本量不同(7 - 82例),CT-BERT-JPN在18个结果中的11个(61%)上优于GPT-4o,在4种情况下实现了完美的F1分数,在14种情况下F1分数>0.95。
我们的研究建立了一个强大的日语CT报告数据集,并证明了专业语言模型在结果结构化分类中的有效性。这种机器翻译和专家验证的混合方法能够在保持高质量标准的同时创建大规模数据集。本研究为推进日本医疗保健环境中的医学人工智能研究提供了重要资源,使用公开可用的数据集和模型促进该领域的进一步发展。