Othman Amna A, Flaharty Kendall A, Ledgister Hanchard Suzanna E, Hu Ping, Duong Dat, Waikel Rebekah L, Solomon Benjamin D
Medical Genomics Unit, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
NPJ Aging. 2025 May 3;11(1):33. doi: 10.1038/s41514-025-00226-z.
Most genetic conditions are described in pediatric populations, leaving a gap in understanding their clinical progression and management in adulthood. Motivated by other applications of large language models (LLMs), we evaluated whether Llama-2-70b-chat (70b) and GPT-3.5 (GPT) could generate plausible medical vignettes, patient-geneticist dialogues and management plans for a hypothetical child and adult patients across 282 genetic conditions (selected by prevalence and categorized based on age-related characteristics). Results showed that LLMs provided appropriate age-based responses in both child and adult outputs based on Correctness and Completeness scores graded by clinicians. Sub-analysis of metabolic conditions including those typically presents neonatally with crisis also showed age-appropriate LLM responses. However 70b and GPT obtained low Correctness and Completeness scores at producing plausible management plans (55-66% for 70b and a wider range, 50-90%, for GPT). This suggests that LLMs still have some limitations in clinical applications.
大多数遗传疾病是在儿科人群中描述的,这使得在理解它们在成年期的临床进展和管理方面存在空白。受大语言模型(LLMs)其他应用的启发,我们评估了Llama-2-70b-chat(70b)和GPT-3.5(GPT)是否能够针对282种遗传疾病(根据患病率选择并按与年龄相关的特征分类)为假设的儿童和成年患者生成合理的医学案例、患者与遗传学家的对话以及管理计划。结果显示,根据临床医生给出的正确性和完整性评分,大语言模型在儿童和成人输出中均提供了基于年龄的适当回答。对包括那些通常在新生儿期出现危机的代谢性疾病的子分析也显示了大语言模型的回答符合年龄特征。然而,70b和GPT在生成合理的管理计划方面获得的正确性和完整性评分较低(70b为55 - 66%,GPT的范围更广,为50 - 90%)。这表明大语言模型在临床应用中仍存在一些局限性。