• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

相似文献

1
Evaluating large language models on medical, lay-language, and self-reported descriptions of genetic conditions.评估大型语言模型在医学、非专业语言和遗传状况的自我报告描述方面的表现。
Am J Hum Genet. 2024 Sep 5;111(9):1819-1833. doi: 10.1016/j.ajhg.2024.07.011. Epub 2024 Aug 14.
2
Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.利用生成式人工智能辅助学习罕见且复杂的诊断:对流行的大型语言模型的定性研究。
JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391.
3
A Comparative Analysis of the Performance of Large Language Models and Human Respondents in Dermatology.大语言模型与人类受试者在皮肤病学方面表现的比较分析
Indian Dermatol Online J. 2025 Feb 27;16(2):241-247. doi: 10.4103/idoj.idoj_221_24. eCollection 2025 Mar-Apr.
4
Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study.评估生成式 AI 大语言模型 ChatGPT、Google Bard 和 Microsoft Bing Chat 在支持循证牙科方面的性能:比较混合方法研究。
J Med Internet Res. 2023 Dec 28;25:e51580. doi: 10.2196/51580.
5
An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study.零样本临床自然语言处理中大型语言模型提示策略的实证评估:算法开发与验证研究
JMIR Med Inform. 2024 Apr 8;12:e55318. doi: 10.2196/55318.
6
Evaluating accuracy and reproducibility of large language model performance on critical care assessments in pharmacy education.评估大语言模型在药学教育中的重症护理评估方面的性能准确性和可重复性。
Front Artif Intell. 2025 Jan 9;7:1514896. doi: 10.3389/frai.2024.1514896. eCollection 2024.
7
Do large language model chatbots perform better than established patient information resources in answering patient questions? A comparative study on melanoma.在回答患者问题方面,大型语言模型聊天机器人的表现是否优于成熟的患者信息资源?一项关于黑色素瘤的比较研究。
Br J Dermatol. 2025 Jan 24;192(2):306-315. doi: 10.1093/bjd/ljae377.
8
Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5 edition.评估大语言模型在与《乳腺影像报告和数据系统》第5版相关问题上的文本和视觉诊断能力。
Diagn Interv Radiol. 2025 Mar 3;31(2):111-129. doi: 10.4274/dir.2024.242876. Epub 2024 Sep 9.
9
Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing.生成式人工智能大语言模型在正畸学中的循证潜力:ChatGPT、谷歌巴德和微软必应的比较研究
Eur J Orthod. 2024 Apr 13. doi: 10.1093/ejo/cjae017.
10
A Comparative Evaluation of Large Language Model Utility in Neuroimaging Clinical Decision Support.大语言模型在神经影像临床决策支持中的效用比较评估
J Imaging Inform Med. 2024 Nov 7. doi: 10.1007/s10278-024-01161-3.

引用本文的文献

1
Survey and improvement strategies for gene prioritization with large language models.基于大语言模型的基因优先级排序的调查与改进策略
Bioinform Adv. 2025 Jun 24;5(1):vbaf148. doi: 10.1093/bioadv/vbaf148. eCollection 2025.
2
Few shot learning for phenotype-driven diagnosis of patients with rare genetic diseases.用于罕见遗传病患者表型驱动诊断的少样本学习。
NPJ Digit Med. 2025 Jun 20;8(1):380. doi: 10.1038/s41746-025-01749-1.
3
Evaluating Large Language Models for Preoperative Patient Education in Superior Capsular Reconstruction: Comparative Study of Claude, GPT, and Gemini.评估大语言模型在肩胛下肌上囊重建术前患者教育中的应用:Claude、GPT和Gemini的比较研究
JMIR Perioper Med. 2025 Jun 12;8:e70047. doi: 10.2196/70047.
4
AI-driven simplification of surgical reports in gynecologic oncology: A potential tool for patient education.人工智能驱动的妇科肿瘤手术报告简化:一种潜在的患者教育工具。
Acta Obstet Gynecol Scand. 2025 Jul;104(7):1373-1381. doi: 10.1111/aogs.15123. Epub 2025 May 14.
5
Assessing large language model performance related to aging in genetic conditions.评估与遗传疾病中的衰老相关的大语言模型性能。
NPJ Aging. 2025 May 3;11(1):33. doi: 10.1038/s41514-025-00226-z.
6
Consistent Performance of GPT-4o in Rare Disease Diagnosis Across Nine Languages and 4967 Cases.GPT-4o在九种语言和4967个病例的罕见病诊断中表现一致。
medRxiv. 2025 Feb 28:2025.02.26.25322769. doi: 10.1101/2025.02.26.25322769.
7
Artificial intelligence in clinical genetics.临床遗传学中的人工智能
Eur J Hum Genet. 2025 Mar;33(3):281-288. doi: 10.1038/s41431-024-01782-w. Epub 2025 Jan 13.
8
Systematic benchmarking demonstrates large language models have not reached the diagnostic accuracy of traditional rare-disease decision support tools.系统基准测试表明,大语言模型尚未达到传统罕见病决策支持工具的诊断准确性。
medRxiv. 2024 Nov 7:2024.07.22.24310816. doi: 10.1101/2024.07.22.24310816.

本文引用的文献

1
Performance of ChatGPT in Diagnosis of Corneal Eye Diseases.ChatGPT 在角膜眼病诊断中的表现。
Cornea. 2024 May 1;43(5):664-670. doi: 10.1097/ICO.0000000000003492. Epub 2024 Feb 23.
2
Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine.诊断推理提示揭示了医学中大型语言模型可解释性的潜力。
NPJ Digit Med. 2024 Jan 24;7(1):20. doi: 10.1038/s41746-024-01010-1.
3
Response to correspondence regarding "Analysis of large-language model versus human performance for genetics questions".对关于“大语言模型与人类在遗传学问题上的表现分析”的信件的回复
Eur J Hum Genet. 2024 Apr;32(4):379-380. doi: 10.1038/s41431-023-01444-3. Epub 2023 Aug 15.
4
Beyond the hype: a comprehensive critique of ChatGPT's applicability in genetics.超越炒作:对ChatGPT在遗传学中适用性的全面批判
Eur J Hum Genet. 2024 Apr;32(4):375-376. doi: 10.1038/s41431-023-01442-5. Epub 2023 Aug 15.
5
Large language models encode clinical knowledge.大语言模型编码临床知识。
Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.
6
Can ChatGPT understand genetics?ChatGPT能理解遗传学吗?
Eur J Hum Genet. 2024 Apr;32(4):371-372. doi: 10.1038/s41431-023-01419-4. Epub 2023 Jul 5.
7
Analysis of large-language model versus human performance for genetics questions.大语言模型与人类在遗传学问题表现上的分析。
Eur J Hum Genet. 2024 Apr;32(4):466-468. doi: 10.1038/s41431-023-01396-8. Epub 2023 May 29.
8
Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum.比较医生和人工智能聊天机器人对发布在公共社交媒体论坛上的患者问题的回复。
JAMA Intern Med. 2023 Jun 1;183(6):589-596. doi: 10.1001/jamainternmed.2023.1838.
9
Appropriateness of Breast Cancer Prevention and Screening Recommendations Provided by ChatGPT.ChatGPT提供的乳腺癌预防和筛查建议的适宜性。
Radiology. 2023 May;307(4):e230424. doi: 10.1148/radiol.230424. Epub 2023 Apr 4.
10
Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.ChatGPT在美国医师执照考试中的表现:使用大语言模型进行人工智能辅助医学教育的潜力。
PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.

评估大型语言模型在医学、非专业语言和遗传状况的自我报告描述方面的表现。

Evaluating large language models on medical, lay-language, and self-reported descriptions of genetic conditions.

机构信息

Medical Genomics Unit, National Human Genome Research Institute, National Institutes of Health, 10 Center Dr, Bethesda, MD 20892, USA.

Medical Genomics Unit, National Human Genome Research Institute, National Institutes of Health, 10 Center Dr, Bethesda, MD 20892, USA.

出版信息

Am J Hum Genet. 2024 Sep 5;111(9):1819-1833. doi: 10.1016/j.ajhg.2024.07.011. Epub 2024 Aug 14.

DOI:10.1016/j.ajhg.2024.07.011
PMID:39146935
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11393706/
Abstract

Large language models (LLMs) are generating interest in medical settings. For example, LLMs can respond coherently to medical queries by providing plausible differential diagnoses based on clinical notes. However, there are many questions to explore, such as evaluating differences between open- and closed-source LLMs as well as LLM performance on queries from both medical and non-medical users. In this study, we assessed multiple LLMs, including Llama-2-chat, Vicuna, Medllama2, Bard/Gemini, Claude, ChatGPT3.5, and ChatGPT-4, as well as non-LLM approaches (Google search and Phenomizer) regarding their ability to identify genetic conditions from textbook-like clinician questions and their corresponding layperson translations related to 63 genetic conditions. For open-source LLMs, larger models were more accurate than smaller LLMs: 7b, 13b, and larger than 33b parameter models obtained accuracy ranges from 21%-49%, 41%-51%, and 54%-68%, respectively. Closed-source LLMs outperformed open-source LLMs, with ChatGPT-4 performing best (89%-90%). Three of 11 LLMs and Google search had significant performance gaps between clinician and layperson prompts. We also evaluated how in-context prompting and keyword removal affected open-source LLM performance. Models were provided with 2 types of in-context prompts: list-type prompts, which improved LLM performance, and definition-type prompts, which did not. We further analyzed removal of rare terms from descriptions, which decreased accuracy for 5 of 7 evaluated LLMs. Finally, we observed much lower performance with real individuals' descriptions; LLMs answered these questions with a maximum 21% accuracy.

摘要

大型语言模型(LLM)在医学领域引起了关注。例如,LLM 可以根据临床记录提供合理的鉴别诊断,从而对医疗查询做出连贯的回应。然而,仍有许多问题需要探索,例如评估开源和闭源 LLM 之间的差异,以及 LLM 对来自医疗和非医疗用户的查询的性能。在这项研究中,我们评估了多种 LLM,包括 Llama-2-chat、Vicuna、Medllama2、Bard/Gemini、Claude、ChatGPT3.5 和 ChatGPT-4,以及非 LLM 方法(Google 搜索和 Phenomizer),以了解它们从教科书式的临床医生问题中识别遗传疾病的能力,以及与 63 种遗传疾病相关的相应外行翻译。对于开源 LLM,较大的模型比较小的模型更准确:7b、13b 和 33b 以上参数的模型的准确率范围分别为 21%-49%、41%-51%和 54%-68%。闭源 LLM 的表现优于开源 LLM,其中 ChatGPT-4 的表现最佳(89%-90%)。在 11 个 LLM 和 Google 搜索中有 3 个存在临床医生和外行提示之间的显著性能差距。我们还评估了上下文提示和关键字删除如何影响开源 LLM 的性能。模型提供了 2 种类型的上下文提示:列表类型提示,提高了 LLM 的性能,和定义类型提示,没有提高性能。我们进一步分析了从描述中删除罕见术语对 7 个评估的 LLM 中的 5 个的影响,发现准确性降低。最后,我们观察到真实个人描述的性能要低得多;LLM 对这些问题的回答准确率最高为 21%。