Suppr超能文献

评估用于提供性健康信息的人工智能(AI)聊天机器人:一项使用真实临床问题的共识研究。

Evaluation of artificial intelligence (AI) chatbots for providing sexual health information: a consensus study using real-world clinical queries.

作者信息

Latt Phyu M, Aung Ei T, Htaik Kay, Soe Nyi N, Lee David, King Alicia J, Fortune Ria, Ong Jason J, Chow Eric P F, Bradshaw Catriona S, Rahman Rashidur, Deneen Matthew, Dobinson Sheranne, Randall Claire, Zhang Lei, Fairley Christopher K

机构信息

Melbourne Sexual Health Centre, Alfred Health, Melbourne, Australia.

School of Translational Medicine, Faculty of Medicine, Nursing and Health Sciences, Monash University, Melbourne, Australia.

出版信息

BMC Public Health. 2025 May 15;25(1):1788. doi: 10.1186/s12889-025-22933-8.

Abstract

INTRODUCTION

Artificial Intelligence (AI) chatbots could potentially provide information on sensitive topics, including sexual health, to the public. However, their performance compared to nurses and across different AI chatbots, particularly in the field of sexual health, remains understudied. This study evaluated the performance of three AI chatbots - two prompt-tuned (Alice and Azure) and one standard chatbot (ChatGPT by OpenAI) - in providing sexual health information on questions that experienced sexual health nurses could correctly answer.

METHODS

We analysed 195 anonymised sexual health questions received by the Melbourne Sexual Health Centre phone line. A panel of experts in a blinded order using a consensus-based approach evaluated responses to these questions from nurses and the three AI chatbots. Performance was assessed based on overall correctness and five specific measures: guidance, accuracy, safety, ease of access, and provision of necessary information. We conducted subgroup analyses for clinic-specific (e.g., opening hours) and general sexual health questions and a sensitivity analysis excluding questions that Azure could not answer.

RESULTS

Alice demonstrated the highest overall correctness (85.2%; 95% confidence interval (CI), 82.1-88.0%), followed by Azure (69.3%; 95% CI, 65.3-73.0%) and ChatGPT (64.8%; 95% CI, 60.7-68.7%). Prompt-tuned chatbots outperformed the base ChatGPT across all measures. Among all outcome measures, all chatbots performed best on safety, with Azure achieving the highest safety score (97.9%; 95% CI, 96.4-98.9%), indicating the lowest risk of providing potentially harmful advice. In subgroup analysis, all chatbots performed better on general sexual health questions compared to clinic-specific queries. Sensitivity analysis showed a narrower performance gap between Alice and Azure when excluding questions Azure could not answer.

CONCLUSIONS

Prompt-tuned AI chatbots demonstrated superior performance in providing sexual health information compared to base ChatGPT, with high safety scores particularly noteworthy. However, all AI chatbots showed susceptibility to generating incorrect information. These findings suggest the potential for AI chatbots as adjuncts to human healthcare providers for providing sexual health information while highlighting the need for continued refinement and human oversight. Future research should focus on larger-scale evaluations and real-world implementations.

摘要

引言

人工智能(AI)聊天机器人有可能向公众提供包括性健康在内的敏感话题的信息。然而,与护士相比以及不同AI聊天机器人之间的性能,尤其是在性健康领域,仍未得到充分研究。本研究评估了三个AI聊天机器人——两个经过提示调整的(爱丽丝和Azure)和一个标准聊天机器人(OpenAI的ChatGPT)——在回答有性健康经验的护士能够正确回答的性健康问题时的表现。

方法

我们分析了墨尔本性健康中心热线收到的195个匿名性健康问题。一个专家小组以盲法顺序采用基于共识的方法评估了护士和这三个AI聊天机器人对这些问题的回答。根据总体正确性和五个具体指标评估性能:指导、准确性、安全性、获取便利性和提供必要信息。我们对特定诊所(如营业时间)和一般性健康问题进行了亚组分析,并进行了敏感性分析,排除了Azure无法回答的问题。

结果

爱丽丝表现出最高的总体正确性(85.2%;95%置信区间(CI),82.1 - 88.0%),其次是Azure(69.3%;95% CI,65.3 - 73.0%)和ChatGPT(64.8%;95% CI,60.7 - 68.7%)。经过提示调整的聊天机器人在所有指标上的表现均优于基础ChatGPT。在所有结果指标中,所有聊天机器人在安全性方面表现最佳,Azure获得了最高的安全分数(97.9%;95% CI,96.4 - 98.9%),表明提供潜在有害建议的风险最低。在亚组分析中,与特定诊所的问题相比,所有聊天机器人在一般性健康问题上的表现更好。敏感性分析表明,排除Azure无法回答的问题后,爱丽丝和Azure之间的性能差距缩小。

结论

与基础ChatGPT相比,经过提示调整的AI聊天机器人在提供性健康信息方面表现出卓越性能,高安全分数尤其值得注意。然而,所有AI聊天机器人都容易产生错误信息。这些发现表明AI聊天机器人有潜力作为人类医疗保健提供者的辅助工具来提供性健康信息,同时突出了持续改进和人工监督的必要性。未来的研究应侧重于大规模评估和实际应用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3864/12080182/d1adac22632c/12889_2025_22933_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验