Suppr超能文献

使用自然语言编程聊天机器人:生成颈椎MRI影像报告

Programming Chatbots Using Natural Language: Generating Cervical Spine MRI Impressions.

作者信息

Javan Ramin, Kim Theodore, Abdelmonem Ahmed, Ismail Ahmed, Jaamour Farris, Melnyk Oleksiy, Heekin Mary

机构信息

Department of Radiology, George Washington University School of Medicine and Health Sciences, Washington, D.C., USA.

Department of Research, California Institute of Behavioral Neurosciences & Psychology, Fairfield, USA.

出版信息

Cureus. 2024 Sep 14;16(9):e69410. doi: 10.7759/cureus.69410. eCollection 2024 Sep.

Abstract

PURPOSE

The utility of machine learning, specifically large language models (LLMs), in the medical field has gained considerable attention. However, there is a scarcity of studies that focus on the application of LLMs in generating custom subspecialty radiology impressions. The primary objective of this study is to evaluate and compare the performance of multiple LLMs in generating specialized, accurate, and clinically useful radiology impressions for degenerative cervical spine MRI reports.

MATERIALS AND METHODS

The study employed a comparative analysis of multiple LLMs, including OpenAI's ChatGPT-3.5 and GPT-4 (OpenAI, San Francisco, CA), Antrhopic's Claude 2 (Anthropic PBC, San Francisco, CA), Google's Bard (Google Inc., Mountain View, CA), and Meta's Llama 2 (Meta Platforms, Inc., Menlo Park, CA). This was performed during January-February 2024. These models were evaluated using a few-shot learning approach on a dataset consisting of 10 examples from 50 synthetically generated MRI reports. Performance metrics evaluated were diagnostic accuracy, stylistic accuracy, and redundancy.

RESULTS

While Claude 2 maintained consistent high performance across 40 cases, GPT-4 required midway re-training to improve its declining scores. Both Claude 2 and GPT-4 demonstrated the ability to generate structured impressions, but Claude 2's specialized summarization capabilities provided an edge in maintaining accuracy without continuous feedback. The other LLMs' performance was subpar.

CONCLUSION

The findings of this study suggest that LLMs can be a valuable tool in automating the generation of radiology impressions. Claude 2, in particular, exhibited promising results, indicating its potential for clinical implementation. However, the study also points to the necessity for further research, especially in optimizing model performance and evaluating real-world applicability.

摘要

目的

机器学习,特别是大语言模型(LLMs)在医学领域的应用已引起广泛关注。然而,专注于大语言模型在生成定制亚专业放射学诊断报告方面应用的研究却很匮乏。本研究的主要目的是评估和比较多个大语言模型在为退行性颈椎MRI报告生成专业、准确且具有临床实用性的放射学诊断报告方面的性能。

材料与方法

本研究对多个大语言模型进行了对比分析,包括OpenAI的ChatGPT-3.5和GPT-4(OpenAI,加利福尼亚州旧金山)、Anthropic的Claude 2(Anthropic PBC,加利福尼亚州旧金山)、谷歌的Bard(谷歌公司,加利福尼亚州山景城)以及Meta的Llama 2(Meta平台公司,加利福尼亚州门洛帕克)。研究于2024年1月至2月进行。这些模型采用少样本学习方法,在一个由50份合成生成的MRI报告中的10个示例组成的数据集上进行评估。评估的性能指标包括诊断准确性、文体准确性和冗余性。

结果

虽然Claude 2在40个病例中保持了一致的高性能,但GPT-4需要在中途重新训练以提高其不断下降的分数。Claude 2和GPT-4都展示了生成结构化诊断报告的能力,但Claude 2的专业总结能力在无需持续反馈的情况下保持准确性方面具有优势。其他大语言模型的表现则较差。

结论

本研究结果表明,大语言模型可成为自动化生成放射学诊断报告的宝贵工具。特别是Claude 2展现出了有前景的结果,表明其具有临床应用潜力。然而,该研究也指出了进一步研究的必要性,尤其是在优化模型性能和评估实际适用性方面。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/058b/11472864/dfe55e7f943d/cureus-0016-00000069410-i01.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验