Chien Aichi, Tang Hubert, Jagessar Bhavita, Chang Kai-Wei, Peng Nanyun, Nael Kambiz, Salamon Noriko
From the Department of Radiological Science (A.C., H.T., B.J., K.N., N.S.), David Geffen School of Medicine at UCLA, Los Angeles, California
From the Department of Radiological Science (A.C., H.T., B.J., K.N., N.S.), David Geffen School of Medicine at UCLA, Los Angeles, California.
AJNR Am J Neuroradiol. 2024 Feb 7;45(2):244-248. doi: 10.3174/ajnr.A8102.
The review of clinical reports is an essential part of monitoring disease progression. Synthesizing multiple imaging reports is also important for clinical decisions. It is critical to aggregate information quickly and accurately. Machine learning natural language processing (NLP) models hold promise to address an unmet need for report summarization.
We evaluated NLP methods to summarize longitudinal aneurysm reports. A total of 137 clinical reports and 100 PubMed case reports were used in this study. Models were 1) compared against expert-generated summary using longitudinal imaging notes collected in our institute and 2) compared using publicly accessible PubMed case reports. Five AI models were used to summarize the clinical reports, and a sixth model, the online GPT3davinci NLP large language model (LLM), was added for the summarization of PubMed case reports. We assessed the summary quality through comparison with expert summaries using quantitative metrics and quality reviews by experts.
In clinical summarization, BARTcnn had the best performance (BERTscore = 0.8371), followed by LongT5Booksum and LEDlegal. In the analysis using PubMed case reports, GPT3davinci demonstrated the best performance, followed by models BARTcnn and then LEDbooksum (BERTscore = 0.894, 0.872, and 0.867, respectively).
AI NLP summarization models demonstrated great potential in summarizing longitudinal aneurysm reports, though none yet reached the level of quality for clinical usage. We found the online GPT LLM outperformed the others; however, the BARTcnn model is potentially more useful because it can be implemented on-site. Future work to improve summarization, address other types of neuroimaging reports, and develop structured reports may allow NLP models to ease clinical workflow.
临床报告审查是监测疾病进展的重要组成部分。综合多份影像报告对临床决策也很重要。快速准确地汇总信息至关重要。机器学习自然语言处理(NLP)模型有望满足报告总结这一未被满足的需求。
我们评估了用于总结动脉瘤纵向报告的NLP方法。本研究共使用了137份临床报告和100份PubMed病例报告。模型:1)与使用我们研究所收集的纵向影像记录由专家生成的总结进行比较;2)使用可公开获取的PubMed病例报告进行比较。使用五个人工智能模型来总结临床报告,并添加了第六个模型,即在线GPT3davinci NLP大语言模型(LLM)来总结PubMed病例报告。我们通过使用定量指标与专家总结进行比较以及专家进行质量评审来评估总结质量。
在临床总结中,BARTcnn表现最佳(BERT分数 = 0.8371),其次是LongT5Booksum和LEDlegal。在使用PubMed病例报告进行的分析中,GPT3davinci表现最佳,其次是BARTcnn模型,然后是LEDbooksum(BERT分数分别为0.894、0.872和0.867)。
人工智能NLP总结模型在总结动脉瘤纵向报告方面显示出巨大潜力,尽管尚无一个达到临床使用的质量水平。我们发现在线GPT大语言模型的表现优于其他模型;然而,BARTcnn模型可能更有用,因为它可以在现场实施。未来在改进总结、处理其他类型神经影像报告以及开发结构化报告方面的工作可能会使NLP模型简化临床工作流程。