Urbina Jacob T, Vu Peter D, Nguyen Michael V
Department of Physical Medicine and Rehabilitation, McGovern Medical School, UTHealth Houston, Houston, TX.
Department of Physical Medicine and Rehabilitation, McGovern Medical School, UTHealth Houston, Houston, TX.
Arch Phys Med Rehabil. 2025 Jan;106(1):14-19. doi: 10.1016/j.apmr.2024.08.014. Epub 2024 Aug 30.
To identify and quantify ability bias in generative artificial intelligence large language model chatbots, specifically OpenAI's ChatGPT and Google's Gemini.
Observational study of language usage in generative artificial intelligence models.
Investigation-only browser profile restricted to ChatGPT and Gemini.
Each chatbot generated 60 descriptions of people prompted without specified functional status, 30 descriptions of people with a disability, 30 descriptions of patients with a disability, and 30 descriptions of athletes with a disability (N=300).
Not applicable.
Generated descriptions produced by the models were parsed into words that were linguistically analyzed into favorable qualities or limiting qualities.
Both large language models significantly underestimated disability in a population of people, and linguistic analysis showed that descriptions of people, patients, and athletes with a disability were generated as having significantly fewer favorable qualities and significantly more limitations than people without a disability in both ChatGPT and Gemini.
Generative artificial intelligence chatbots demonstrate quantifiable ability bias and often exclude people with disabilities in their responses. Ethical use of these generative large language model chatbots in medical systems should recognize this limitation, and further consideration should be taken in developing equitable artificial intelligence technologies.
识别并量化生成式人工智能大语言模型聊天机器人中的能力偏差,特别是OpenAI的ChatGPT和谷歌的Gemini。
对生成式人工智能模型中的语言使用情况进行观察性研究。
仅限调查的浏览器配置文件,仅限于ChatGPT和Gemini。
每个聊天机器人生成了60份未指定功能状态的人物描述、30份残疾人描述、30份残疾患者描述和30份残疾运动员描述(N = 300)。
不适用。
将模型生成的描述解析为单词,并对其进行语言分析,以确定其具有的积极品质或限制品质。
两个大语言模型都显著低估了人群中的残疾情况,语言分析表明,在ChatGPT和Gemini中,与非残疾人相比,生成的残疾人和残疾运动员的描述具有显著更少的积极品质和显著更多的限制。
生成式人工智能聊天机器人表现出可量化的能力偏差,并且在回复中常常排除残疾人。在医疗系统中道德使用这些生成式大语言模型聊天机器人应认识到这一局限性,并且在开发公平的人工智能技术时应进一步加以考虑。