干眼症医疗信息中大型语言模型的比较分析

Comparative Analysis of LLMs in Dry Eye Syndrome Healthcare Information.

作者信息

Wu Gloria, Paliath-Pathiyal Hrishi, Khan Obaid, Wang Margaret C

机构信息

Department of Ophthalmology, School of Medicine, University of California, San Francisco, CA 94143, USA.

Department of Biological Sciences, Halmos College of Arts and Sciences, Nova Southeastern University, Fort Lauderdale, FL 33328, USA.

出版信息

Diagnostics (Basel). 2025 Jul 30;15(15):1913. doi: 10.3390/diagnostics15151913.

DOI:10.3390/diagnostics15151913

PMID:40804875

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12346532/

Abstract

Dry eye syndrome affects 16 million Americans with USD 52 billion in annual healthcare costs. With large language models (LLMs) increasingly used for healthcare information, understanding their performance in delivering equitable dry eye guidance across diverse populations is critical. This study aims to evaluate and compare five major LLMs (Grok, ChatGPT, Gemini, Claude.ai, and Meta AI) regarding dry eye syndrome information delivery across different demographic groups. LLMs were queried using standardized prompts simulating a 62-year-old patient with dry eye symptoms across four demographic categories (White, Black, East Asian, and Hispanic males and females). Responses were analyzed for word count, readability, cultural sensitivity scores (0-3 scale), keyword coverage, and response times. Significant variations existed across LLMs. Word counts ranged from 32 to 346 words, with Gemini being the most comprehensive (653.8 ± 96.2 words) and Claude.ai being the most concise (207.6 ± 10.8 words). Cultural sensitivity scores revealed Grok demonstrated highest awareness for minority populations (scoring 3 for Black and Hispanic demographics), while Meta AI showed minimal cultural tailoring (0.5 ± 0.5). All models recommended specialist consultation, but medical term coverage varied significantly. Response times ranged from 7.41 s (Meta AI) to 25.32 s (Gemini). While all LLMs provided appropriate referral recommendations, substantial disparities exist in cultural sensitivity, content depth, and information delivery across demographic groups. No LLM consistently addressed the full spectrum of dry eye causes across all demographics. These findings underscore the importance for physician oversight and standardization in AI-generated healthcare information to ensure equitable access and prevent care delays.

摘要

干眼症综合征影响着1600万美国人，每年的医疗费用高达520亿美元。随着大语言模型（LLMs）越来越多地用于医疗保健信息，了解它们在为不同人群提供公平的干眼症指导方面的表现至关重要。本研究旨在评估和比较五个主要的大语言模型（Grok、ChatGPT、Gemini、Claude.ai和Meta AI）在为不同人口群体提供干眼症综合征信息方面的情况。使用标准化提示对大语言模型进行查询，模拟一名有干眼症症状的62岁患者，涉及四个人口类别（白人、黑人、东亚人和西班牙裔男性和女性）。对回复进行了字数统计、可读性、文化敏感度评分（0至3分制）、关键词覆盖范围和回复时间的分析。不同的大语言模型之间存在显著差异。字数从32字到346字不等，Gemini最为全面（653.8 ± 96.2字），Claude.ai最为简洁（207.6 ± 10.8字）。文化敏感度评分显示，Grok对少数群体的关注度最高（在黑人和西班牙裔人口统计中得分为3分），而Meta AI的文化针对性最低（0.5 ± 0.5）。所有模型都建议进行专科咨询，但医学术语的覆盖范围差异很大。回复时间从7.41秒（Meta AI）到25.32秒（Gemini）不等。虽然所有大语言模型都提供了适当的转诊建议，但在文化敏感度、内容深度和不同人口群体的信息提供方面存在很大差异。没有一个大语言模型能始终涵盖所有人口统计中干眼症的全部病因。这些发现强调了医生监督和人工智能生成的医疗保健信息标准化的重要性，以确保公平获取并防止治疗延误。