使用领域内语言模型进行多任务训练以进行诊断推理

Multi-Task Training with In-Domain Language Models for Diagnostic Reasoning.

作者信息

Sharma Brihat, Gao Yanjun, Miller Timothy, Churpek Matthew M, Afshar Majid, Dligach Dmitriy

机构信息

University of Wisconsin-Madison.

Boston Children's Hospital and Harvard Medical School.

出版信息

Proc Conf Assoc Comput Linguist Meet. 2023 Jul;2023(ClinicalNLP):78-85.

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10368094/

Abstract

Generative artificial intelligence (AI) is a promising direction for augmenting clinical diagnostic decision support and reducing diagnostic errors, a leading contributor to medical errors. To further the development of clinical AI systems, the Diagnostic Reasoning Benchmark (DR.BENCH) was introduced as a comprehensive generative AI framework, comprised of six tasks representing key components in clinical reasoning. We present a comparative analysis of in-domain versus out-of-domain language models as well as multi-task versus single task training with a focus on the problem summarization task in DR.BENCH (Gao et al., 2023). We demonstrate that a multi-task, clinically-trained language model outperforms its general domain counterpart by a large margin, establishing a new state-of-the-art performance, with a ROUGE-L score of 28.55. This research underscores the value of domain-specific training for optimizing clinical diagnostic reasoning tasks.

摘要

生成式人工智能（AI）是增强临床诊断决策支持和减少诊断错误的一个有前景的方向，诊断错误是医疗差错的一个主要原因。为了推动临床AI系统的发展，诊断推理基准（DR.BENCH）作为一个全面的生成式AI框架被引入，它由代表临床推理关键组成部分的六个任务组成。我们对领域内与领域外语言模型以及多任务与单任务训练进行了比较分析，重点关注DR.BENCH中的问题总结任务（Gao等人，2023年）。我们证明，经过临床训练的多任务语言模型比其通用领域的对应模型有大幅提升，建立了新的最先进性能，ROUGE-L分数为28.55。这项研究强调了针对特定领域训练对优化临床诊断推理任务的价值。

相似文献

1

Multi-Task Training with In-Domain Language Models for Diagnostic Reasoning.

Proc Conf Assoc Comput Linguist Meet. 2023 Jul;2023(ClinicalNLP):78-85.

2

DR.BENCH: Diagnostic Reasoning Benchmark for Clinical Natural Language Processing.

J Biomed Inform. 2023 Feb;138:104286. doi: 10.1016/j.jbi.2023.104286. Epub 2023 Jan 25.

3

Exploring the Efficacy of Large Language Models in Summarizing Mental Health Counseling Sessions: Benchmark Study.

JMIR Ment Health. 2024 Jul 23;11:e57306. doi: 10.2196/57306.

4

Benchmarking for biomedical natural language processing tasks with a domain specific ALBERT.

BMC Bioinformatics. 2022 Apr 21;23(1):144. doi: 10.1186/s12859-022-04688-w.

5

Summarizing Patients' Problems from Hospital Progress Notes Using Pre-trained Sequence-to-Sequence Models.

Proc Int Conf Comput Ling. 2022 Oct;2022:2979-2991.

6

Impact of high-quality, mixed-domain data on the performance of medical language models.

J Am Med Inform Assoc. 2024 Sep 1;31(9):1875-1883. doi: 10.1093/jamia/ocae120.

7

Applications of artificial intelligence (AI) in diagnostic radiology: a technography study.

Eur Radiol. 2021 Apr;31(4):1805-1811. doi: 10.1007/s00330-020-07230-9. Epub 2020 Sep 18.

8

Adapting State-of-the-Art Deep Language Models to Clinical Information Extraction Systems: Potentials, Challenges, and Solutions.

JMIR Med Inform. 2019 Apr 25;7(2):e11499. doi: 10.2196/11499.

9

Benchmark datasets driving artificial intelligence development fail to capture the needs of medical professionals.

J Biomed Inform. 2023 Jan;137:104274. doi: 10.1016/j.jbi.2022.104274. Epub 2022 Dec 17.

10

CLIN-X: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain.

Bioinformatics. 2022 Jun 13;38(12):3267-3274. doi: 10.1093/bioinformatics/btac297.

引用本文的文献

1

Natural Language Processing for Digital Health in the Era of Large Language Models.

Yearb Med Inform. 2024 Aug;33(1):229-240. doi: 10.1055/s-0044-1800750. Epub 2025 Apr 8.

2

AI-assisted facial analysis in healthcare: From disease detection to comprehensive management.

Patterns (N Y). 2025 Feb 4;6(2):101175. doi: 10.1016/j.patter.2025.101175. eCollection 2025 Feb 14.

3

Applications and Concerns of ChatGPT and Other Conversational Large Language Models in Health Care: Systematic Review.

J Med Internet Res. 2024 Nov 7;26:e22769. doi: 10.2196/22769.

4

A Systematic Review of ChatGPT and Other Conversational Large Language Models in Healthcare.

medRxiv. 2024 Apr 27:2024.04.26.24306390. doi: 10.1101/2024.04.26.24306390.

本文引用的文献

1

DR.BENCH: Diagnostic Reasoning Benchmark for Clinical Natural Language Processing.

J Biomed Inform. 2023 Feb;138:104286. doi: 10.1016/j.jbi.2023.104286. Epub 2023 Jan 25.

2

Hierarchical Annotation for Building A Suite of Clinical Natural Language Processing Tasks: Progress Note Understanding.

LREC Int Conf Lang Resour Eval. 2022 Jun;2022:5484-5493.

3

Length and Redundancy of Outpatient Progress Notes Across a Decade at an Academic Medical Center.

JAMA Netw Open. 2021 Jul 1;4(7):e2115334. doi: 10.1001/jamanetworkopen.2021.15334.

4

Physician stress and burnout: the impact of health information technology.

J Am Med Inform Assoc. 2019 Feb 1;26(2):106-114. doi: 10.1093/jamia/ocy145.

5

Improving completeness of electronic problem lists through clinical decision support: a randomized, controlled trial.

J Am Med Inform Assoc. 2012 Jul-Aug;19(4):555-61. doi: 10.1136/amiajnl-2011-000521. Epub 2012 Jan 3.

6

Educational strategies to promote clinical diagnostic reasoning.

N Engl J Med. 2006 Nov 23;355(21):2217-25. doi: 10.1056/NEJMra054782.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

文档翻译

学术文献翻译模型，支持多种主流文档格式。