Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The University of Jordan, Amman, 11942, Jordan.
Department of Translational Medicine, Faculty of Medicine, Lund University, Malmö, 22184, Sweden.
BMC Infect Dis. 2024 Aug 8;24(1):799. doi: 10.1186/s12879-024-09725-y.
Assessment of artificial intelligence (AI)-based models across languages is crucial to ensure equitable access and accuracy of information in multilingual contexts. This study aimed to compare AI model efficiency in English and Arabic for infectious disease queries.
The study employed the METRICS checklist for the design and reporting of AI-based studies in healthcare. The AI models tested included ChatGPT-3.5, ChatGPT-4, Bing, and Bard. The queries comprised 15 questions on HIV/AIDS, tuberculosis, malaria, COVID-19, and influenza. The AI-generated content was assessed by two bilingual experts using the validated CLEAR tool.
In comparing AI models' performance in English and Arabic for infectious disease queries, variability was noted. English queries showed consistently superior performance, with Bard leading, followed by Bing, ChatGPT-4, and ChatGPT-3.5 (P = .012). The same trend was observed in Arabic, albeit without statistical significance (P = .082). Stratified analysis revealed higher scores for English in most CLEAR components, notably in completeness, accuracy, appropriateness, and relevance, especially with ChatGPT-3.5 and Bard. Across the five infectious disease topics, English outperformed Arabic, except for flu queries in Bing and Bard. The four AI models' performance in English was rated as "excellent", significantly outperforming their "above-average" Arabic counterparts (P = .002).
Disparity in AI model performance was noticed between English and Arabic in response to infectious disease queries. This language variation can negatively impact the quality of health content delivered by AI models among native speakers of Arabic. This issue is recommended to be addressed by AI developers, with the ultimate goal of enhancing health outcomes.
评估跨语言的人工智能 (AI) 模型对于确保多语言环境中信息的公平获取和准确性至关重要。本研究旨在比较 AI 模型在英语和阿拉伯语中用于传染病查询的效率。
本研究采用了 METRICS 清单来设计和报告医疗保健中的 AI 研究。测试的 AI 模型包括 ChatGPT-3.5、ChatGPT-4、Bing 和 Bard。查询包括关于 HIV/AIDS、结核病、疟疾、COVID-19 和流感的 15 个问题。AI 生成的内容由两名双语专家使用经过验证的 CLEAR 工具进行评估。
在比较 AI 模型在英语和阿拉伯语中用于传染病查询的性能时,注意到了变异性。英语查询表现出一致的卓越性能,Bard 领先,其次是 Bing、ChatGPT-4 和 ChatGPT-3.5(P=0.012)。在阿拉伯语中也观察到了相同的趋势,尽管没有统计学意义(P=0.082)。分层分析显示,在大多数 CLEAR 组件中,英语的分数更高,尤其是在完整性、准确性、适当性和相关性方面,尤其是在 ChatGPT-3.5 和 Bard 中。在五个传染病主题中,英语的表现优于阿拉伯语,除了 Bing 和 Bard 中的流感查询。四个 AI 模型在英语中的表现被评为“优秀”,明显优于其阿拉伯语对应物的“平均以上”(P=0.002)。
在对传染病查询的响应中,注意到 AI 模型在英语和阿拉伯语之间的性能存在差异。这种语言差异可能会对阿拉伯语母语者使用 AI 模型提供的健康内容的质量产生负面影响。建议 AI 开发人员解决这个问题,最终目标是改善健康结果。