• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

临床自然语言处理的DRAGON基准测试。

The DRAGON benchmark for clinical NLP.

作者信息

Bosma Joeran S, Dercksen Koen, Builtjes Luc, André Romain, Roest Christian, Fransen Stefan J, Noordman Constant R, Navarro-Padilla Mar, Lefkes Judith, Alves Natália, de Grauw Max J J, van Eekelen Leander, Spronck Joey M A, Schuurmans Megan, de Wilde Bram, Hendrix Ward, Aswolinskiy Witali, Saha Anindo, Twilt Jasper J, Geijs Daan, Veltman Jeroen, Yakar Derya, de Rooij Maarten, Ciompi Francesco, Hering Alessa, Geerdink Jeroen, Huisman Henkjan

机构信息

Diagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical Center, Nijmegen, The Netherlands.

Department of Health & Information Technology, Ziekenhuisgroep Twente, Almelo, The Netherlands.

出版信息

NPJ Digit Med. 2025 May 17;8(1):289. doi: 10.1038/s41746-025-01626-x.

DOI:10.1038/s41746-025-01626-x
PMID:40379835
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12084576/
Abstract

Artificial Intelligence can mitigate the global shortage of medical diagnostic personnel but requires large-scale annotated datasets to train clinical algorithms. Natural Language Processing (NLP), including Large Language Models (LLMs), shows great potential for annotating clinical data to facilitate algorithm development but remains underexplored due to a lack of public benchmarks. This study introduces the DRAGON challenge, a benchmark for clinical NLP with 28 tasks and 28,824 annotated medical reports from five Dutch care centers. It facilitates automated, large-scale, cost-effective data annotation. Foundational LLMs were pretrained using four million clinical reports from a sixth Dutch care center. Evaluations showed the superiority of domain-specific pretraining (DRAGON 2025 test score of 0.770) and mixed-domain pretraining (0.756), compared to general-domain pretraining (0.734, p < 0.005). While strong performance was achieved on 18/28 tasks, performance was subpar on 10/28 tasks, uncovering where innovations are needed. Benchmark, code, and foundational LLMs are publicly available.

摘要

人工智能可以缓解全球医学诊断人员短缺的问题,但需要大规模的带注释数据集来训练临床算法。包括大语言模型(LLMs)在内的自然语言处理(NLP)在注释临床数据以促进算法开发方面显示出巨大潜力,但由于缺乏公共基准,仍未得到充分探索。本研究介绍了DRAGON挑战,这是一个用于临床NLP的基准,包含来自荷兰五个护理中心的28个任务和28824份带注释的医学报告。它有助于实现自动化、大规模且经济高效的数据注释。基础大语言模型使用来自荷兰第六个护理中心的400万份临床报告进行预训练。评估显示,与通用领域预训练(0.734,p < 0.005)相比,特定领域预训练(DRAGON 2025测试分数为0.770)和混合领域预训练(0.756)具有优越性。虽然在28个任务中的18个任务上取得了强劲的性能,但在28个任务中的10个任务上性能低于标准,揭示了需要创新的地方。基准、代码和基础大语言模型均可公开获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be5f/12084576/0fd4de0cbae6/41746_2025_1626_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be5f/12084576/4c73f65ee773/41746_2025_1626_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be5f/12084576/ab836f9717f7/41746_2025_1626_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be5f/12084576/8462af36c618/41746_2025_1626_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be5f/12084576/0fd4de0cbae6/41746_2025_1626_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be5f/12084576/4c73f65ee773/41746_2025_1626_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be5f/12084576/ab836f9717f7/41746_2025_1626_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be5f/12084576/8462af36c618/41746_2025_1626_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be5f/12084576/0fd4de0cbae6/41746_2025_1626_Fig4_HTML.jpg

相似文献

1
The DRAGON benchmark for clinical NLP.临床自然语言处理的DRAGON基准测试。
NPJ Digit Med. 2025 May 17;8(1):289. doi: 10.1038/s41746-025-01626-x.
2
An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study.零样本临床自然语言处理中大型语言模型提示策略的实证评估:算法开发与验证研究
JMIR Med Inform. 2024 Apr 8;12:e55318. doi: 10.2196/55318.
3
Using Synthetic Health Care Data to Leverage Large Language Models for Named Entity Recognition: Development and Validation Study.利用合成医疗保健数据借助大语言模型进行命名实体识别:开发与验证研究。
J Med Internet Res. 2025 Mar 18;27:e66279. doi: 10.2196/66279.
4
Evaluating large language models for health-related text classification tasks with public social media data.利用公共社交媒体数据评估用于健康相关文本分类任务的大型语言模型。
J Am Med Inform Assoc. 2024 Oct 1;31(10):2181-2189. doi: 10.1093/jamia/ocae210.
5
Developing healthcare language model embedding spaces.开发医疗保健语言模型嵌入空间。
Artif Intell Med. 2024 Dec;158:103009. doi: 10.1016/j.artmed.2024.103009. Epub 2024 Oct 31.
6
Benchmark datasets driving artificial intelligence development fail to capture the needs of medical professionals.基准数据集推动人工智能发展未能捕捉到医疗专业人员的需求。
J Biomed Inform. 2023 Jan;137:104274. doi: 10.1016/j.jbi.2022.104274. Epub 2022 Dec 17.
7
Few-Shot Learning for Clinical Natural Language Processing Using Siamese Neural Networks: Algorithm Development and Validation Study.使用暹罗神经网络的临床自然语言处理少样本学习:算法开发与验证研究
JMIR AI. 2023 May 4;2:e44293. doi: 10.2196/44293.
8
Comparative Evaluation of LLMs in Clinical Oncology.临床肿瘤学中大型语言模型的比较评估
NEJM AI. 2024 May;1(5). doi: 10.1056/aioa2300151. Epub 2024 Apr 16.
9
EndoViT: pretraining vision transformers on a large collection of endoscopic images.EndoViT:在大量内窥镜图像上预训练视觉转换器。
Int J Comput Assist Radiol Surg. 2024 Jun;19(6):1085-1091. doi: 10.1007/s11548-024-03091-5. Epub 2024 Apr 3.
10
A comparative study of large language model-based zero-shot inference and task-specific supervised classification of breast cancer pathology reports.基于大语言模型的零样本推理与乳腺癌病理报告任务特定监督分类的比较研究。
J Am Med Inform Assoc. 2024 Oct 1;31(10):2315-2327. doi: 10.1093/jamia/ocae146.

引用本文的文献

1
Orchestrated multi agents sustain accuracy under clinical-scale workloads compared to a single agent.与单个智能体相比,精心编排的多智能体在临床规模的工作量下能保持准确性。
medRxiv. 2025 Aug 24:2025.08.22.25334049. doi: 10.1101/2025.08.22.25334049.

本文引用的文献

1
Matching patients to clinical trials with large language models.利用大型语言模型为患者匹配临床试验。
Nat Commun. 2024 Nov 18;15(1):9074. doi: 10.1038/s41467-024-53081-z.
2
Large language models for structured reporting in radiology: past, present, and future.用于放射学结构化报告的大语言模型:过去、现在和未来。
Eur Radiol. 2025 May;35(5):2589-2602. doi: 10.1007/s00330-024-11107-6. Epub 2024 Oct 23.
3
Fine-tuning large language models for rare disease concept normalization.微调大型语言模型以实现罕见病概念规范化。
J Am Med Inform Assoc. 2024 Sep 1;31(9):2076-2083. doi: 10.1093/jamia/ocae133.
4
TCGA-Reports: A machine-readable pathology report resource for benchmarking text-based AI models.TCGA报告:用于基准测试基于文本的人工智能模型的机器可读病理报告资源。
Patterns (N Y). 2024 Feb 21;5(3):100933. doi: 10.1016/j.patter.2024.100933. eCollection 2024 Mar 8.
5
Artificial intelligence-supported screen reading versus standard double reading in the Mammography Screening with Artificial Intelligence trial (MASAI): a clinical safety analysis of a randomised, controlled, non-inferiority, single-blinded, screening accuracy study.人工智能支持的屏幕阅读与人工智能筛查中的标准双读(MASAI)试验:一项随机、对照、非劣效、单盲、筛查准确性研究的临床安全性分析。
Lancet Oncol. 2023 Aug;24(8):936-944. doi: 10.1016/S1470-2045(23)00298-X.
6
ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge.ChatDoctor:一种基于医学领域知识对大型语言模型Meta-AI(LLaMA)进行微调的医学聊天模型。
Cureus. 2023 Jun 24;15(6):e40895. doi: 10.7759/cureus.40895. eCollection 2023 Jun.
7
Large language models encode clinical knowledge.大语言模型编码临床知识。
Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.
8
Trends in the incidence of pulmonary nodules in chest computed tomography: 10-year results from two Dutch hospitals.胸部 CT 检出肺结节的发病趋势:来自荷兰两家医院的 10 年研究结果。
Eur Radiol. 2023 Nov;33(11):8279-8288. doi: 10.1007/s00330-023-09826-3. Epub 2023 Jun 20.
9
Accurate and Reliable Classification of Unstructured Reports on Their Diagnostic Goal Using BERT Models.使用BERT模型对非结构化报告的诊断目标进行准确可靠的分类。
Diagnostics (Basel). 2023 Mar 27;13(7):1251. doi: 10.3390/diagnostics13071251.
10
DR.BENCH: Diagnostic Reasoning Benchmark for Clinical Natural Language Processing.DR.BENCH:临床自然语言处理的诊断推理基准。
J Biomed Inform. 2023 Feb;138:104286. doi: 10.1016/j.jbi.2023.104286. Epub 2023 Jan 25.