大语言模型与合成健康数据：进展与前景

Large language models and synthetic health data: progress and prospects.

作者信息

Smolyak Daniel, Bjarnadóttir Margrét V, Crowley Kenyon, Agarwal Ritu

机构信息

Department of Computer Science, University of Maryland, College Park, College Park, MD 20742, United States.

Robert H. Smith School of Business, University of Maryland, College Park, College Park, MD 20740, United States.

出版信息

JAMIA Open. 2024 Oct 26;7(4):ooae114. doi: 10.1093/jamiaopen/ooae114. eCollection 2024 Dec.

DOI:10.1093/jamiaopen/ooae114

PMID:39464796

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11512648/

Abstract

OBJECTIVES

Given substantial obstacles surrounding health data acquisition, high-quality synthetic health data are needed to meet a growing demand for the application of advanced analytics for clinical discovery, prediction, and operational excellence. We highlight how recent advances in large language models (LLMs) present new opportunities for progress, as well as new risks, in synthetic health data generation (SHDG).

MATERIALS AND METHODS

We synthesized systematic scoping reviews in the SHDG domain, recent LLM methods for SHDG, and papers investigating the capabilities and limits of LLMs.

RESULTS

We summarize the current landscape of generative machine learning models (eg, Generative Adversarial Networks) for SHDG, describe remaining challenges and limitations, and identify how recent LLM approaches can potentially help mitigate them.

DISCUSSION

Six research directions are outlined for further investigation of LLMs for SHDG: evaluation metrics, LLM adoption, data efficiency, generalization, health equity, and regulatory challenges.

CONCLUSION

LLMs have already demonstrated both high potential and risks in the health domain, and it is important to study their advantages and disadvantages for SHDG.

摘要

目标

鉴于健康数据获取存在诸多重大障碍，需要高质量的合成健康数据来满足对先进分析方法在临床发现、预测和卓越运营方面应用日益增长的需求。我们强调了大语言模型（LLMs）的最新进展如何为合成健康数据生成（SHDG）带来新的机遇以及新的风险。

材料与方法

我们综合了SHDG领域的系统综述、用于SHDG的最新LLM方法以及研究LLMs能力和局限性的论文。

结果

我们总结了用于SHDG的生成式机器学习模型（如生成对抗网络）的当前情况，描述了剩余的挑战和局限性，并确定了最近的LLM方法如何有可能帮助缓解这些问题。

讨论

概述了六个研究方向，以进一步研究用于SHDG的LLMs：评估指标、LLM采用、数据效率、泛化、健康公平性和监管挑战。

结论

LLMs在健康领域已经展现出高潜力和风险，研究它们在SHDG方面的优缺点很重要。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9bdd/11512648/d5745370dec0/ooae114f1.jpg

相似文献

Large language models and synthetic health data: progress and prospects.大语言模型与合成健康数据：进展与前景

JAMIA Open. 2024 Oct 26;7(4):ooae114. doi: 10.1093/jamiaopen/ooae114. eCollection 2024 Dec.

The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review.大型语言模型在变革急诊医学中的作用：范围综述

JMIR Med Inform. 2024 May 10;12:e53787. doi: 10.2196/53787.

Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study.评估生成式人工智能工具理解医学论文的能力：定性研究

JMIR Med Inform. 2024 Sep 4;12:e59258. doi: 10.2196/59258.

Large Language Models and User Trust: Consequence of Self-Referential Learning Loop and the Deskilling of Health Care Professionals.大语言模型与用户信任：自我参照学习循环的后果及医疗保健专业人员的技能退化

J Med Internet Res. 2024 Apr 25;26:e56764. doi: 10.2196/56764.

Artificial Intelligence in Dental Education: Opportunities and Challenges of Large Language Models and Multimodal Foundation Models.人工智能在牙科教育中的应用：大型语言模型和多模态基础模型的机遇与挑战。

JMIR Med Educ. 2024 Sep 27;10:e52346. doi: 10.2196/52346.

Multimodal Large Language Models in Health Care: Applications, Challenges, and Future Outlook.医疗保健中的多模态大型语言模型：应用、挑战和未来展望。

J Med Internet Res. 2024 Sep 25;26:e59505. doi: 10.2196/59505.

The Opportunities and Risks of Large Language Models in Mental Health.大语言模型在精神健康中的机遇与风险。

JMIR Ment Health. 2024 Jul 29;11:e59479. doi: 10.2196/59479.

Generative Large Language Models in Electronic Health Records for Patient Care Since 2023: A Systematic Review.2023年以来电子健康记录中用于患者护理的生成式大语言模型：一项系统综述

medRxiv. 2024 Aug 19:2024.08.11.24311828. doi: 10.1101/2024.08.11.24311828.

A Systematic Review of ChatGPT and Other Conversational Large Language Models in Healthcare.ChatGPT及其他对话式大语言模型在医疗保健领域的系统评价

medRxiv. 2024 Apr 27:2024.04.26.24306390. doi: 10.1101/2024.04.26.24306390.

Assessing the research landscape and clinical utility of large language models: a scoping review.评估大型语言模型的研究现状和临床实用性：范围综述。

BMC Med Inform Decis Mak. 2024 Mar 12;24(1):72. doi: 10.1186/s12911-024-02459-6.

引用本文的文献

Future Designs of Clinical Trials in Nephrology: Integrating Methodological Innovation and Computational Power.肾脏病学临床试验的未来设计：整合方法学创新与计算能力

Sensors (Basel). 2025 Aug 8;25(16):4909. doi: 10.3390/s25164909.

Synthetic Patient-Physician Conversations Simulated by Large Language Models: A Multi-Dimensional Evaluation.由大语言模型模拟的合成医患对话：多维评估

Sensors (Basel). 2025 Jul 10;25(14):4305. doi: 10.3390/s25144305.

Exploring Detection Methods for Synthetic Medical Datasets Created With a Large Language Model.探索用大语言模型创建的合成医学数据集的检测方法。

JAMA Ophthalmol. 2025 Apr 24. doi: 10.1001/jamaophthalmol.2025.0834.

本文引用的文献

A SEMIPARAMETRIC MULTIPLE IMPUTATION APPROACH TO FULLY SYNTHETIC DATA FOR COMPLEX SURVEYS.一种用于复杂调查的全合成数据的半参数多重填补方法。

J Surv Stat Methodol. 2022 Jun;10(3):618-641. doi: 10.1093/jssam/smac016. Epub 2022 May 25.

Can large language models reason about medical questions?大型语言模型能对医学问题进行推理吗？

Patterns (N Y). 2024 Mar 1;5(3):100943. doi: 10.1016/j.patter.2024.100943. eCollection 2024 Mar 8.

Large Language Models for Healthcare Data Augmentation: An Example on Patient-Trial Matching.大型语言模型在医疗保健数据增强中的应用：以患者-试验匹配为例。

AMIA Annu Symp Proc. 2024 Jan 11;2023:1324-1333. eCollection 2023.

Two Directions for Clinical Data Generation with Large Language Models: Data-to-Label and Label-to-Data.利用大语言模型生成临床数据的两个方向：数据到标签和标签到数据。

Proc Conf Empir Methods Nat Lang Process. 2023 Dec;2023:7129-7143. doi: 10.18653/v1/2023.findings-emnlp.474.

Evaluating the Utility and Privacy of Synthetic Breast Cancer Clinical Trial Data Sets.评估合成乳腺癌临床试验数据集的效用和隐私性。

JCO Clin Cancer Inform. 2023 Sep;7:e2300116. doi: 10.1200/CCI.23.00116.

Centering Data Sovereignty, Tribal Values, and Practices for Equity in American Indian and Alaska Native Public Health Systems.将数据主权、部落价值观和公平实践置于中心位置，以建立美国印第安人和阿拉斯加原住民公共卫生系统。

Public Health Rep. 2024 Jul-Aug;139(1_suppl):10S-15S. doi: 10.1177/00333549231199477. Epub 2023 Oct 21.

Harnessing the power of synthetic data in healthcare: innovation, application, and privacy.利用合成数据在医疗保健领域的力量：创新、应用与隐私。

NPJ Digit Med. 2023 Oct 9;6(1):186. doi: 10.1038/s41746-023-00927-3.

The shaky foundations of large language models and foundation models for electronic health records.用于电子健康记录的大语言模型和基础模型的不稳定基础。

NPJ Digit Med. 2023 Jul 29;6(1):135. doi: 10.1038/s41746-023-00879-8.

Large language models in medicine.医学中的大型语言模型。

Nat Med. 2023 Aug;29(8):1930-1940. doi: 10.1038/s41591-023-02448-8. Epub 2023 Jul 17.

Chatbot vs Medical Student Performance on Free-Response Clinical Reasoning Examinations.聊天机器人与医学生在自由应答临床推理考试中的表现对比

JAMA Intern Med. 2023 Sep 1;183(9):1028-1030. doi: 10.1001/jamainternmed.2023.2909.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验